BimPressioNs [BNP11]

Posted in Books, pictures, Statistics, Travel, University life, Wines with tags , , , , , , , , , on June 29, 2017 by xi'an

While my participation to BNP 11 has so far been more at the janitor level [although not gaining George Casella’s reputation on NPR!] than at the scientific one, since we had decided in favour of the least expensive and unstaffed option for coffee breaks, to keep the registration fees at a minimum [although I would have gladly gone all the way to removing all coffee breaks!, if only because such breaks produce much garbage], I had fairly good chats at the second poster session, in particular around empirical likelihoods and HMC for discrete parameters, the first one based on the general Cressie-Read formulation and the second around the recently arXived paper of Nishimura et al., which I wanted to read. Plus many other good chats full stop, around terrific cheese platters!

View this post on Instagram

A post shared by Shane Jensen (@tastierkakes) on

This morning, the coffee breaks were much more under control and I managed to enjoy [and chair] the entire session on empirical likelihood, with absolutely fantastic talks from Nils Hjort and Art Owen (the third speaker having gone AWOL, possibly a direct consequence of Trump’s travel ban).

asymptotically exact inference in likelihood-free models [a reply from the authors]

Posted in R, Statistics with tags , , , , , , , , , , , , , , , , , on December 1, 2016 by xi'an

[Following my post of lastTuesday, Matt Graham commented on the paper with force détails. Here are those comments. A nicer HTML version of the Markdown reply below is also available on Github.]

Thanks for the comments on the paper!

A few additional replies to augment what Amos wrote:

This however sounds somewhat intense in that it involves a quasi-Newton resolution at each step.

The method is definitely computationally expensive. If the constraint function is of the form of a function from an M-dimensional space to an N-dimensional space, with MN, for large N the dominant costs at each timestep are usually the constraint Jacobian (c/u) evaluation (with reverse-mode automatic differentiation this can be evaluated at a cost of O(N) generator / constraint evaluations) and Cholesky decomposition of the Jacobian product (c/u)(c/u) with O(N³) cost (though in many cases e.g. i.i.d. or Markovian simulated data, structure in the generator Jacobian can be exploited to give a significantly reduced cost). Each inner Quasi-Newton update involves a pair of triangular solve operations which have a O(N²) cost, two matrix-vector multiplications with O(MN) cost, and a single constraint / generator function evaluation; the number of Quasi-Newton updates required for convergence in the numerical experiments tended to be much less than N hence the Quasi-Newton iteration tended not to be the main cost.

The high computation cost per update is traded off however with often being able to make much larger proposed moves in high-dimensional state spaces with a high chance of acceptance compared to ABC MCMC approaches. Even in the relatively small Lotka-Volterra example we provide which has an input dimension of 104 (four inputs which map to ‘parameters’, and 100 inputs which map to ‘noise’ variables), the ABC MCMC chains using the coarse ABC kernel radius ϵ=100 with comparably very cheap updates were significantly less efficient in terms of effective sample size / computation time than the proposed constrained HMC approach. This was in large part due to the elliptical slice sampling updates in the ABC MCMC chains generally collapsing down to very small moves even for this relatively coarse ϵ. Performance was even worse using non-adaptive ABC MCMC methods and for smaller ϵ, and for higher input dimensions (e.g. using a longer sequence with correspondingly more random inputs) the comparison becomes even more favourable for the constrained HMC approach. Continue reading

Posted in Books, Mountains, pictures, Statistics, University life with tags , , , , , on May 26, 2015 by xi'an

[Here are comments made by Matt Graham that I thought would be more readable in a post format. The beautiful picture of the Alps above is his as well. I do not truly understand what Matt’s point is, as I did not cover continuous time processes in my discussion…]

In terms of interpretation of the diffusion with non-reversible drift component, I think this can be generalised from the Gaussian invariant density case by

dx = [ – (∂E/∂x) dt + √2 dw ] + S’ (∂E/∂x) dt

where ∂E/∂x is the usual gradient of the negative log (unnormalised) density / energy and S=-S’ is a skew symmetric matrix. In this form it seems the dynamic can be interpreted as the composition of an energy and volume conserving (non-canonical) Hamiltonian dynamic

dx/dt = S’ ∂E/∂x

and a (non-preconditioned) Langevin diffusion

dx = – (∂E/∂x) dt + √2 dw

As an alternative to discretising the combined dynamic, it might be interesting to compare to sequential alternation between ‘Hamiltonian’ steps either using a simple Euler discretisation

x’ = x + h S’ ∂E/∂x

or a symplectic method like implicit midpoint to maintain reversibility / volume preservation and then a standard MALA step

x’ = x – h (∂E/∂x) + √2 h w, w ~ N(0, I)

plus MH accept. If only one final MH accept step is done this overall dynamic will be reversible, however if a an intermediary MH accept was done after each Hamiltonian step (flipping the sign / transposing S on a rejection to maintain reversibility), the composed dynamic would in general be non-longer reversible and it would be interesting to compare performance in this case to that using a non-reversible MH acceptance on the combined dynamic (this alternative sidestepping the issues with finding an appropriate scale ε to maintain the non-negativity condition on the sum of the vorticity density and joint density on a proposed and current state).

With regards to your point on the positivity of g(x,y)+π(y)q(y,x), I’m not sure if I have fully understood your notation correctly or not, but I think you may have misread the definition of g(x,y) for the discretised Ornstein-Uhlenbeck case (apologies if instead the misinterpretation is mine!). The vorticity density is defined as the skew symmetric component of the density f of F(dx, dy) = µ(dx) Q(x, dy) with respect to the Lebesgue measure, where µ(dx) is the true invariant distribution of the Euler-Maruyama discretised diffusion based proposal density Q(x, dy) rather than g(x, y) being defined in terms of the skew-symmetric component of π(dx) Q(x, dy) which in general would lead to a vorticity density which does not meet the zero integral requirement as the target density π is not invariant in general with respect to Q.

Pre-ordinary meeting

Posted in R, Statistics, Travel, University life with tags , , , , , on October 8, 2010 by xi'an

Those are the slides for the (basic) introduction of the paper by Mark Girolami and Ben Calderhead at the RSS next week. Not to be confused with my comments on the paper.

Riemann, Langevin & Hamilton

Posted in Statistics, University life with tags , , , , , on September 27, 2010 by xi'an

In preparation for the Read Paper session next month at the RSS, our research group at CREST has collectively read the Girolami and Calderhead paper on Riemann manifold Langevin and Hamiltonian Monte Carlo methods and I hope we will again produce a joint arXiv preprint out of our comments. (The above picture is reproduced from Radford Neal’s talk at JSM 1999 in Baltimore, talk that I remember attending…) Although this only represents my preliminary/early impression on the paper, I have trouble with the Physics connection. Because it involves continuous time events that are not transcribed directly into the simulation process.

Overall, trying to take advantage of second order properties of the target—just like the Langevin improvement takes advantage of the first order—is a natural idea which, when implementable, can obviously speed up convergence. This is the Langevin part, which may use a fixed metric M or a local metric defining a Riemann manifold, G(θ). So far, so good, assuming the derivation of an observed or expected information G(θ) is feasible up to some approximation level. The Hamiltonian part that confuses me introduces a dynamic on level sets of

$\mathscr{H}(\theta,\mathbf{p})=-\mathcal{L}(\theta)+\frac{1}{2}\log\{(2\pi)^D|\mathbf{G}(\theta)|\}+\frac{1}{2}\mathbf{p}^\text{T}\mathbf{G}(\theta)^{-1}\mathbf{p}$

where p is an auxiliary vector of dimension D. Namely,

$\dot{\mathbf{p}} = \dfrac{\partial \mathscr{H}}{\partial \mathbf{p}}(\theta,\mathbf{p})\,,\qquad\dot{\theta}=\dfrac{\partial \mathscr{H}}{\partial \theta}(\theta,\mathbf{p})\,.$

While I understand the purpose of the auxiliary vector, namely to speed up the exploration of the posterior surface by taking advantage of the additional energy provided by p, I fail to understand why the fact that the discretised (Euler) approximation to Hamilton’s equations is not available in closed form is such an issue…. The fact that the (deterministic?) leapfrog integrator is not exact should not matter since this can be corrected by a Metropolis-Hastings step.

While the logistic example is mostly a toy problem (where importance sampling works extremely well, as shown in our survey with Jean-Michel Marin), the stochastic volatility is more challenging and the fact that the Hamiltonian scheme applies to the missing data (volatility) as well as to the three parameters of the model is quite interesting. I however wonder at the appeal of this involved scheme when considering that the full conditional of the volatility can be simulated exactly