Archive for delayed rejection sampling

sampling, transport, and diffusions

Posted in pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , on November 18, 2022 by xi'an


This week, I am attending a very cool workshop at the Flatiron Institute (not in the Flatiron building!, but close enough) on Sampling, Transport, and Diffusions, organised by Bob Carpenter and Michael Albergo. It is quite exciting as I do not know most participants or their work! The Flatiron Institute is a private institute focussed on fundamental science funded by the Simons Foundation (in such working conditions universities cannot compete with!).

Eric Vanden-Eijden gave an introductory lecture on using optimal transport notion to improve sampling, with a PDE/ODE approach of continuously turning a base distribution into a target (formalised by the distribution at time one). This amounts to solving a velocity solution to an KL optimisation objective whose target value is zero. Velocity parameterised as a deep neural network density estimator. Using a score function in a reverse SDE inspired by Hyvärinnen (2005), with a surprising occurrence of Stein’s unbiased estimator, there for the same reasons of getting rid of an unknown element. In a lot of environments, simulating from the target is the goal and this can be achieved by MCMC sampling by normalising flows, learning the transform / pushforward map.

At the break, Yuling Yao made a very smart remark that testing between two models could also be seen as an optimal transport, trying to figure an optimal transform from one model to the next, rather than the bland mixture model we used in our mixtestin paper. At this point I have no idea about the practical difficulty of using / inferring the parameters of this continuum but one could start from normalising flows. Because of time continuity, one would need some driving principle.

Esteban Tabak gave another interest talk on simulating from a conditional distribution, which sounds like a no-problem when the conditional density is known but a challenge when only pairs are observed. The problem is seen as a transport problem to a barycentre obtained as a distribution independent from the conditioning z and then inverting. Constructing maps through flows. Very cool, even possibly providing an answer for causality questions.

Many of the transport talks involved normalizing flows. One by [Simons Fellow] Christopher Jazynski about adding to the Hamiltonian (in HMC) an artificial flow field  (Vaikuntanathan and Jarzynski, 2009) to make up for the Hamiltonian moving too fast for the simulation to keep track. Connected with Eric Vanden-Eijden’s talk in the end.

An interesting extension of delayed rejection for HMC by Chirag Modi, with a manageable correction à la Antonietta Mira. Johnatan Niles-Weed provided a nonparametric perspective on optimal transport following Hütter+Rigollet, 21 AoS. With forays into the Sinkhorn algorithm, mentioning Aude Genevay’s (Dauphine graduate) regularisation.

Michael Lindsey gave a great presentation on the estimation of the trace of a matrix by the Hutchinson estimator for sdp matrices using only matrix multiplication. Solution surprisingly relying on Gibbs sampling called thermal sampling.

And while it did not involve optimal transport, I gave a short (lightning) talk on our recent adaptive restore paper: although in retrospect a presentation of Wasserstein ABC could have been more suited to the audience.

general perspective on the Metropolis–Hastings kernel

Posted in Books, Statistics with tags , , , , , , , , , , , , , on January 14, 2021 by xi'an

[My Bristol friends and co-authors] Christophe Andrieu, and Anthony Lee, along with Sam Livingstone arXived a massive paper on 01 January on the Metropolis-Hastings kernel.

“Our aim is to develop a framework making establishing correctness of complex Markov chain Monte Carlo kernels a purely mechanical or algebraic exercise, while making communication of ideas simpler and unambiguous by allowing a stronger focus on essential features (…) This framework can also be used to validate kernels that do not satisfy detailed balance, i.e. which are not reversible, but a modified version thereof.”

A central notion in this highly general framework is, extending Tierney (1998), to see an MCMC kernel as a triplet involving a probability measure μ (on an extended space), an involution transform φ generalising the proposal step (i.e. þ²=id), and an associated acceptance probability ð. Then μ-reversibility occurs for

\eth(\xi)\mu(\text{d}\xi)= \eth(\phi(\xi))\mu^{\phi}(\text{d}\xi)

with the rhs involving the push-forward measure induced by μ and φ. And furthermore there is always a choice of an acceptance probability ð ensuring for this equality to happen. Interestingly, the new framework allows for mostly seamless handling of more complex versions of MCMC such as reversible jump and parallel tempering. But also non-reversible kernels, incl. for instance delayed rejection. And HMC, incl. NUTS. And pseudo-marginal, multiple-try, PDMPs, &c., &c. it is remarkable to see such a general theory emerging a this (late?) stage of the evolution of the field (and I will need more time and attention to understand its consequences).

more multiple proposal MCMC

Posted in Books, Statistics with tags , , , , , , , on July 26, 2018 by xi'an

Luo and Tjelmeland just arXived a paper on a new version of multiple-try Metropolis Hastings, the addendum being in defining the additional proposed copies via a dependence graph like (a) above, with one version from the target and all others from operational and conditional proposal kernels. Respecting the dependence graph, as in (b). As I did, you may then wonder where both the graph and the conditional do come from. Which reminds me of the pseudo-posteriors of Carlin and Chib (1995), even though this is not terribly connected. Green (1995).) (But not disconnected either since the authors mention But, given the graph, following a Gibbs scheme, one of the 17 nodes is chosen as generated from the target, using the proper conditional on that index [which is purely artificial from the point of view of the original simulation problem!]. As noted above, the graph is an issue, but since it is artificial, it can be devised to either allow for quasi-independence between the proposed values or on the opposite to induce long range dependence, which corresponds to conducting multiple MCMC steps before reaching the end nodes, a feature that is very appealing in my opinion. And reminds me of prefetching. (As I am listening to Nicolas Chopin’s lecture in Warwick at the moment, there also seems to be a connection with pMCMC.) Still, I remain unclear as to the devising of the graph of dependent proposals, as its depth should be somehow connected with the mixing properties of the original proposal. Gains in convergence may thus come at a high cost at the construction stage.

computational methods for statistical mechanics [day #4]

Posted in Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , on June 7, 2014 by xi'an

Arthur Seat, Edinburgh, Sep. 7, 2011

My last day at this ICMS workshop on molecular simulation started [with a double loop of Arthur’s Seat thankfully avoiding the heavy rains of the previous night and then] Chris Chipot‘s magistral entry to molecular simulation for proteins with impressive slides and simulation movies, even though I could not follow the details to really understand the simulation challenges therein, just catching a few connections with earlier talks. A typical example of a cross-disciplinary gap, where the other discipline always seems to be stressing the ‘wrong” aspects. Although this is perfectly unrealistic, it would immensely to prepare talks in pairs for such interdisciplinary workshops! Then Gersende Fort presented results about convergence and efficiency for the Wang-Landau algorithm. The idea is to find the optimal rate for updating the weights of the elements of the partition towards reaching the flat histogram in minimal time. Showing massive gains on toy examples. The next talk went back to molecular biology with Jérôme Hénin‘s presentation on improved adaptive biased sampling. With an exciting notion of orthogonality aiming at finding the slowest directions in the target and putting the computational effort. He also discussed the tension between long single simulations and short repeated ones, echoing a long-going debate in the MCMC community. (He also had a slide with a picture of my first 1983 Apple IIe computer!) Then Antonietta Mira gave a broad perspective on delayed rejection and zero variance estimates. With impressive variance reductions (although some physicists then asked for reduction of order 10¹⁰!). Johannes Zimmer gave a beautiful maths talk on the connection between particle and diffusion limits (PDEs) and Wasserstein geometry and large deviations. (I did not get most of the talk, but it was nonetheless beautiful!) Bert Kappen concluded the day (and the workshop for me) by a nice introduction to control theory. Making connection between optimal control and optimal importance sampling. Which made me idly think of the following problem: what if control cannot be completely… controlled and hence involves a stochastic part? Presumably of little interest as the control would then be on the parameters of the distribution of the control.

“The alanine dipeptide is the fruit fly of molecular simulation.”

The example of this alanine dipeptide molecule was so recurrent during the talks that it justified the above quote by Michael Allen. Not that I am more proficient in the point of studying this protein or using it as a benchmark. Or in identifying the specifics of the challenges of molecular dynamics simulation. Not a criticism of the ICMS workshop obviously, but rather of my congenital difficulty with continuous time processes!!! So I do not return from Edinburgh with a new research collaborative project in molecular dynamics (if with more traditional prospects), albeit with the perception that a minimal effort could bring me to breach the vocabulary barrier. And maybe consider ABC ventures in those (new) domains. (Although I fear my talk on ABC did not impact most of the audience!)

Andrew gone NUTS!

Posted in pictures, R, Statistics, University life with tags , , , , , , , , , , on November 24, 2011 by xi'an

Matthew Hoffman and Andrew Gelman have posted a paper on arXiv entitled “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo” and developing an improvement on the Hamiltonian Monte Carlo algorithm called NUTS (!). Here is the abstract:

Hamiltonian Monte Carlo (HMC) is a Markov chain Monte Carlo (MCMC) algorithm that avoids the random walk behavior and sensitivity to correlated parameters that plague many MCMC methods by taking a series of steps informed by first-order gradient information. These features allow it to converge to high-dimensional target distributions much more quickly than simpler methods such as random walk Metropolis or Gibbs sampling. However, HMC’s performance is highly sensitive to two user-specified parameters: a step size ε and a desired number of steps L. In particular, if L is too small then the algorithm exhibits undesirable random walk behavior, while if L is too large the algorithm wastes computation. We introduce the No-U-Turn Sampler (NUTS), an extension to HMC that eliminates the need to set a number of steps L. NUTS uses a recursive algorithm to build a set of likely candidate points that spans a wide swath of the target distribution, stopping automatically when it starts to double back and retrace its steps. Empirically, NUTS perform at least as efficiently as and sometimes more efficiently than a well tuned standard HMC method, without requiring user intervention or costly tuning runs. We also derive a method for adapting the step size parameter ε on the fly based on primal-dual averaging. NUTS can thus be used with no hand-tuning at all. NUTS is also suitable for applications such as BUGS-style automatic inference engines that require efficient “turnkey” sampling algorithms.

Now, my suspicious and pessimistic nature always makes me wary of universality claims! I completely acknowledge the difficulty in picking the number of leapfrog steps L in the Hamiltonian algorithm, since the theory behind does not tell anything useful about L. And the paper is quite convincing in its description of the NUTS algorithm, being moreover beautifully written.  As indicated in the paper, the “doubling” solution adopted by NUTS is reminding me of Radford Neal’s procedure in his Annals of Statistics paper on slice sampling. (The NUTS algorithm also relies on a slice sampling step.) However, besides its ability to work as an automatic Hamiltonian methodology, I wonder about the computing time (and the “unacceptably large amount of memory”, p.12) required by the doubling procedure: 2j is growing fast with j! (If my intuition is right, the computing time should increase rather quickly with the dimension. And I do not get the argument within the paper that the costly part is the gradient computation: it seems to me the gradient must be computed for all of the 2j points.) The authors also mention delayed rejection à la Tierney and Mira (1999) and the scheme reminded me a wee of the pinball sampler we devised a while ago with Kerrie Mengersen. Choosing the discretisation step ε is more “traditional”, using the stochastic approximation approach we set in our unpublished-yet-often-quoted tech report with Christophe Andrieu. (I do not think I got the crux of the “dual averaging” for optimal calibration on p.11) The illustration through four benchmarks [incl. a stochastic volatility model!] is quite convincing as well, with (unsurprisingly) great graphical tools. A final grumble: that the code is “only” available in the proprietary language Matlab! Now, I bet some kind of Rao-Blackwellisation is feasible with all the intermediate simulations!

%d bloggers like this: