Archive for Hamiltonian

6th Workshop on Sequential Monte Carlo Methods

Posted in Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , , , , , , , , on May 16, 2024 by xi'an

Very glad to be back to an SMC workshop as it has been nine years since my attending SMC 2015 in Malakoff! The more for the workshop taking place in Edinburgh and at the Bayes Centre. It is one of these places where I feel somewhat returning to familiar grounds with accumulated memories. Like my last visit there when I had a tea with Mike Titterington…

The overall pace of the workshop was quite nice, with long breaks for informal discussions (and time for ‘oggin’!) and interesting poster late afternoons, helped by the small number of them at each instance, incl. one on reversible jump HMC. Here are a few scribbled entries about some talks along the first two days.

After my opening talk (!), Joaquín Míguez talked about the impact of a sequential (Euler-Marayama) discretisation scheme for stochastic differential equations on Bayesian filtering with control of the approximation effect. Axel Finke (in a joint work with Adrien Corenflos, now an ERC Ocean postdoc in Warwick) built a sequence of particle filter algorithms targeting good performances (high expected jumping distance) against both large dimensions and high time horizon, exploiting gradient shift MALA-like, as well as prior impact, with the conclusion that their jack-of-all-trades solutions, Particle­-MALA and Particle­-mGRAD, enjoyed this resistance in nearly normal models. Interesting reminder of the auxiliary particle trick and good insights on using the smoothing target, even when accounting for the computing time, but too many versions for a single talk without checking against the preprint.

The SMC sampler-like algorithm involves propagating N “seed” particles z(i), with a mutation mechanism consisting of the generation of N integrator snippets 𝗓:=(z,ψ⁢(z),ψ²⁢(z),…) started at every seed particle z(i), resulting in N×(T+1) particles which are then whittled down to a set of N seed particles using a standard resampling scheme. Andrieu et al., 2024

Christophe Andrieu talked about Monte Carlo sampling with integrator snippets, starting with recycling solutions for the leapfrog integrator HMC and unfolding Hamiltonians for moving more easily. With snippets representing discretised paths along the level sets being used as particles, picking zero, one, or more particles along each path, since importance weights are connection with multinomial HMC

This relatively small algorithmic modification of the conditional particle filter, which we call the conditional backward sampling particle filter has a dramatically improved performance over the conditional particle filter. Karjalainen et al., 2024

Anthony Lee looked at mixing times for backward sampling SMC (CBPF/ancestor sampling) cf Lee et al. (2020), where the backward step consists in computing the weight of a randomly drawn backward or ancestral history. Improving on earlier results to reach mixing time O(log T) and complexity O(T log T) (with T the time horizon). Thanks to maximal coupling and boundedness assumptions on the prior and likelihood functions.

Neil Chada presented a work on Bayesian multilevel Monte Carlo on deep networks. À la Giles, with a telescoping identity. Always puzzling to envision a prior on all parameters of a neural network. Achieving a computational cost inverse to the order of the MSE, at best. With a useful reminder that pushing the size of the NN to infinity results in a (poor) Gaussian process prior (Sell et al., 2023).

On my first evening, I stopped with a friend in my favourite Blonde [restaurant], as in almost every other visit to Edinburgh, enjoyable as always, but I also found the huge offer of Asian minimarkets in the area too tempting to resist, between Indian, Korean, and Chinese products. (Although with a disappointing hojicha!). As I could not reach any new Munro by train or bus within a reasonable time range I resorted to the nearer Pentland Hills, with a stop by Rosslyn Chapel (mostly of Da Vinci Code fame!, if classic enough). And some delays in finding a bus getting there (misled by google map!) and a trail (misled by my poor map reading skills) up the actual hills. The mist did not help either.

robustified Hamiltonian

Posted in Books, Statistics, University life with tags , , , , , , , , , on April 1, 2022 by xi'an

In Gregynog, last week, Lionel Riou-Durant (Warwick) presented his recent work with Jure Vogrinc on Metropolis Adjusted Langevin Trajectories, which I had also heard in the Séminaire Parisien de Statistique two weeks ago. Starting with a nice exposition of Hamiltonian Monte Carlo, highlighting its drawbacks. This includes the potentially damaging impact of poorly tuning the integration time. Their proposal is to act upon the velocity in the Hamiltonian through Langevin (positive) damping, which also preserves the stationarity.  (And connects with randomised HMC.) One theoretical in the paper is that the Langevin diffusion achieves the fastest mixing rate among randomised HMCs. From a practical perspective, there exists a version of the leapfrog integrator that adapts to this setting and can be implemented as a Metropolis adjustment. (Hence the MALT connection.) An interesting feature is that the process as such is ergodic, which avoids renewal steps (and U-turns). (There are still calibration parameters to adjust, obviously.)

mean field Langevin system & neural networks

Posted in Statistics with tags , , , , , , , on February 4, 2020 by xi'an

A colleague of mine in Paris Dauphine, Zhenjie Ren, recently gave a talk on recent papers of his connecting neural nets and Langevin. Estimating the parameters of the NNs by mean-field Langevin dynamics. Following from an earlier paper on the topic by Mei, Montanari & Nguyen in 2018. Here are some notes I took during the seminar, not necessarily coherent as I was a bit under the weather that day. And had no previous exposure to most notions.

Fitting a one-layer network is turned into a minimisation programme over a measure space (when using loads of data). A reformulation that makes the problem convex. Adding a regularisation by the entropy and introducing derivatives of a functional against the measure. With a necessary and sufficient condition for the solution to be unique when the functional is convex. This reformulation leads to a Fokker-Planck equation, itself related with a Langevin diffusion. Except there is a measure in the Langevin equation, which stationary version is the solution of the original regularised minimisation programme.

A second paper contains an extension to deep NN, re-expressed as a problem in a random environment. Or with a marginal constraint (one marginal distribution being constrained). With a partial derivative wrt the marginal measure. Turning into a Langevin diffusion with an extra random element. Using optimal control produces a new Hamiltonian. Eventually producing the mean-field Langevin system as backward propagation. Coefficients being computed by chain rule, equivalent to a Euler scheme for Langevin dynamics.

This approach holds consequence for GANs with discriminator as one-layer NN and generator minimised over two measures. The discriminator is the invariant measure of the mean-field Langevin dynamics. Mentioning Metropolis-Hastings GANs which seem to require one full run of an MCMC algorithm at each iteration of the mean-field Langevin.

BimPressioNs [BNP11]

Posted in Books, pictures, Statistics, Travel, University life, Wines with tags , , , , , , , , , on June 29, 2017 by xi'an

While my participation to BNP 11 has so far been more at the janitor level [although not gaining George Casella’s reputation on NPR!] than at the scientific one, since we had decided in favour of the least expensive and unstaffed option for coffee breaks, to keep the registration fees at a minimum [although I would have gladly gone all the way to removing all coffee breaks!, if only because such breaks produce much garbage], I had fairly good chats at the second poster session, in particular around empirical likelihoods and HMC for discrete parameters, the first one based on the general Cressie-Read formulation and the second around the recently arXived paper of Nishimura et al., which I wanted to read. Plus many other good chats full stop, around terrific cheese platters!

This morning, the coffee breaks were much more under control and I managed to enjoy [and chair] the entire session on empirical likelihood, with absolutely fantastic talks from Nils Hjort and Art Owen (the third speaker having gone AWOL, possibly a direct consequence of Trump’s travel ban).

asymptotically exact inference in likelihood-free models [a reply from the authors]

Posted in R, Statistics with tags , , , , , , , , , , , , , , , , , on December 1, 2016 by xi'an

[Following my post of lastTuesday, Matt Graham commented on the paper with force détails. Here are those comments. A nicer HTML version of the Markdown reply below is also available on Github.]

Thanks for the comments on the paper!

A few additional replies to augment what Amos wrote:

This however sounds somewhat intense in that it involves a quasi-Newton resolution at each step.

The method is definitely computationally expensive. If the constraint function is of the form of a function from an M-dimensional space to an N-dimensional space, with MN, for large N the dominant costs at each timestep are usually the constraint Jacobian (c/u) evaluation (with reverse-mode automatic differentiation this can be evaluated at a cost of O(N) generator / constraint evaluations) and Cholesky decomposition of the Jacobian product (c/u)(c/u) with O(N³) cost (though in many cases e.g. i.i.d. or Markovian simulated data, structure in the generator Jacobian can be exploited to give a significantly reduced cost). Each inner Quasi-Newton update involves a pair of triangular solve operations which have a O(N²) cost, two matrix-vector multiplications with O(MN) cost, and a single constraint / generator function evaluation; the number of Quasi-Newton updates required for convergence in the numerical experiments tended to be much less than N hence the Quasi-Newton iteration tended not to be the main cost.

The high computation cost per update is traded off however with often being able to make much larger proposed moves in high-dimensional state spaces with a high chance of acceptance compared to ABC MCMC approaches. Even in the relatively small Lotka-Volterra example we provide which has an input dimension of 104 (four inputs which map to ‘parameters’, and 100 inputs which map to ‘noise’ variables), the ABC MCMC chains using the coarse ABC kernel radius ϵ=100 with comparably very cheap updates were significantly less efficient in terms of effective sample size / computation time than the proposed constrained HMC approach. This was in large part due to the elliptical slice sampling updates in the ABC MCMC chains generally collapsing down to very small moves even for this relatively coarse ϵ. Performance was even worse using non-adaptive ABC MCMC methods and for smaller ϵ, and for higher input dimensions (e.g. using a longer sequence with correspondingly more random inputs) the comparison becomes even more favourable for the constrained HMC approach. Continue reading