## the HMC algorithm meets the exchange algorithm

Posted in Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , on July 26, 2017 by xi'an

Julien Stoehr (now in Dublin, soon to join us as a new faculty in Paris-Dauphine!), Alan Benson and Nial Friel (both at UCD) arXived last week a paper entitled Noisy HMC for doubly-intractable distributions. Which considers solutions for adapting Hamiltonian Monte Carlo to target densities that involve a missing constant. In the sense of our workshop last year in Warwick. And in the theme pursued by Nial in the past years. The notion is thus to tackle a density π(θ)∞exp(V(X|θ)/Z(θ) when Z(θ) is intractable. In that case the gradient of log Z(θ) can be estimated as the expectation of the gradient of V(X|θ) [as a standard exponential family identity]. And the ratio of the Z(θ)’s appearing in the Metropolis ratio can be derived by Iain Murray’s exchange algorithm, based on simulations from the sampling distribution attached to the parameter in the denominator.

The resulting algorithm proposed by the authors thus uses N simulations of auxiliary variables at each step þ of the leapfrog part, towards an approximation of the gradient term, plus another N simulations for approximating the ratio of the normalising constants Z(θ)/Z(θ’). While justified from an importance sampling perspective, this approximation is quite poor when θ and θ’ differ. A better solution [as shown in the paper] is to take advantage of all leapfrog steps and of associated auxiliary simulations to build a telescopic product of ratios where the parameter values θ and θ’ are much closer. The main difficulty is in drawing a comparison with the exchange algorithm, since the noisy HMC version is computationally more demanding. (A secondary difficulty is in having an approximate algorithm that no longer leaves the target density stationary.)

## HMC sampling in Bayesian empirical likelihood computation

Posted in Statistics with tags , , , , , , , on March 31, 2017 by xi'an

While working on the Series B’log the other day I noticed this paper by Chauduri et al. on Hamiltonian Monte Carlo and empirical likelihood: how exciting!!! Here is the abstract of the paper:

We consider Bayesian empirical likelihood estimation and develop an efficient Hamiltonian Monte Car lo method for sampling from the posterior distribution of the parameters of interest.The method proposed uses hitherto unknown properties of the gradient of the underlying log-empirical-likelihood function. We use results from convex analysis to show that these properties hold under minimal assumptions on the parameter space, prior density and the functions used in the estimating equations determining the empirical likelihood. Our method employs a finite number of estimating equations and observations but produces valid semi-parametric inference for a large class of statistical models including mixed effects models, generalized linear models and hierarchical Bayes models. We overcome major challenges posed by complex, non-convex boundaries of the support routinely observed for empirical likelihood which prevent efficient implementation of traditional Markov chain Monte Car lo methods like random-walk Metropolis–Hastings sampling etc. with or without parallel tempering. A simulation study confirms that our method converges quickly and draws samples from the posterior support efficiently. We further illustrate its utility through an analysis of a discrete data set in small area estimation.

[The comment is reposted from Series B’log, where I wrote it first.]

It is of particular interest for me [disclaimer: I was not involved in the review of this paper!] as we worked on ABC thru empirical likelihood, which is about the reverse of the current paper in terms of motivation: when faced with a complex model, we substitute an empirical likelihood version for the real thing, run simulations from the prior distribution and use the empirical likelihood as a proxy. With possible intricacies when the data is not iid (an issue we also met with Wasserstein distances.) In this paper the authors instead consider working on an empirical likelihood as their starting point and derive an HMC algorithm to do so. The idea is striking in that, by nature, an empirical likelihood is not a very smooth object and hence does not seem open to producing gradients and Hessians. As illustrated by Figure 1 in the paper . Which is so spiky at places that one may wonder at the representativity of such graphs.

I have always had a persistent worry about the ultimate validity of treating the empirical likelihood as a genuine likelihood, from the fact that it is the result of an optimisation problem to the issue that the approximate empirical distribution has a finite (data-dependent) support, hence is completely orthogonal to the true distribution. And to the one that the likelihood function is zero outside the convex hull of the defining equations…(For one thing, this empirical likelihood is always bounded by one but this may be irrelevant after all!)

The computational difficulty in handling the empirical likelihood starts with its support. Eliminating values of the parameter for which this empirical likelihood is zero amounts to checking whether zero belongs to the above convex hull. A hard (NP hard?) problem. (Although I do not understand why the authors dismiss the token observations of Owen and others. The argument that Bayesian analysis does more than maximising a likelihood seems to confuse the empirical likelihood as a product of a maximisation step with the empirical likelihood as a function of the parameter that can be used as any other function.)

In the simple regression example (pp.297-299), I find the choice of the moment constraints puzzling, in that they address the mean of the white noise (zero) and the covariance with the regressors (zero too). Puzzling because my definition of the regression model is conditional on the regressors and hence does not imply anything on their distribution. In a sense this is another model. But I also note that the approach focus on the distribution of the reconstituted white noises, as we did in the PNAS paper. (The three examples processed in the paper are all simple and could be processed by regular MCMC, thus making the preliminary step of calling for an empirical likelihood somewhat artificial unless I missed the motivation. The paper also does not seem to discuss the impact of the choice of the moment constraints or the computing constraints involved by a function that is itself the result of a maximisation problem.)

A significant part of the paper is dedicated to the optimisation problem and the exclusion of the points on the boundary. Which sounds like a non-problem in continuous settings. However, this appears to be of importance for running an HMC as it cannot evade the support (without token observations). On principle, HMC should not leave this support since the gradient diverges at the boundary, but in practice the leapfrog approximation may lead the path outside. I would have (naïvely?) suggested to reject moves when this happens and start again but the authors consider that proper choices of the calibration factors of HMC can avoid this problem. Which seems to induce a practical issue by turning the algorithm into an adaptive version.

As a last point, I would have enjoyed seeing a comparison of the performances against our (A)BCel version, which would have been straightforward to implement in the simple examples handled by the paper. (This could be a neat undergraduate project for next year!)

## the fundamental incompatibility of HMC and data subsampling

Posted in Books, Statistics, University life with tags , , , , , , on February 23, 2015 by xi'an

Last week, Michael Betancourt, from WarwickarXived a neat wee note on the fundamental difficulties in running HMC on a subsample of the original data. The core message is that using only one fraction of the data to run an HMC with the hope that it will preserve the stationary distribution does not work. The only way to recover from the bias is to use a Metropolis-Hastings step using the whole data, a step that both kills most of the computing gain and has very low acceptance probabilities. Even the strategy that subsamples for each step in a single trajectory fails: there cannot be a significant gain in time without a significant bias in the outcome. Too bad..! Now, there are ways of accelerating HMC, for instance by parallelising the computation of gradients but, just as in any other approach (?), the information provided by the whole data is only available when looking at the whole data.

## Monte Carlo workshop (Tage 1 & 2)

Posted in Statistics, Travel, University life with tags , , , , , , , , , , on February 21, 2013 by xi'an

Gathering with simulators from other fields (mostly [quantum] physicists) offers both the appeal of seeing different perspectives on simulation and the diffiulty of having to filter alien vocabulary and presentation styles (generally assuming too much background from the audience). For instance; while the first talk on Tuesday by Gergely Barnaföldi about using GPUs for simulation was quite accessible, showing poor performances of the (CPU based) Mersenne twister., when using Dieharder as the evaluator. (This was in comparison with GPU-based solutions.) This provided an interesting contrapoint to the (later) seminar by Frederik James on random generators. (Of course, I did have some preliminary background on the topic.)

On the opposite, the second talk by Stefan Schäfer involved hybrid Monte Carlo methods but it took a lot of efforts (for me) to translate back to my understanding of the notion, gathered from this earlier Read Paper of Girolami and Calderhead, with the heat-bath and leapfrog algorithms. One extreme talk in this regard was William Lester’s talk on Wednesday morning on quantum Monte Carlo and its applications in computational chemistry where I could not get past the formulas! Too bad because it sounded quite innovative with notions like variational Monte Carlo and diffusion Monte Carlo… Nice movies, though. On the other hand, the final talk of the morning by Gabor Molnar-Saska on option pricing was highly pedagogical, defining everything and using simple examples as illustrations. (It certainly did not cure my misgivings about modelling the evolution of stock prices via pre-defined diffusions like Black-and-Scholes’, but the introduction was welcome, given the heterogeneity of the audience.) Both talks on transportation problems were also more accessible (maybe because they involved no pysics!)

The speakers in the afternoon sessions of Wednesday also made a huge effort to bring the whole audience up-to-date about their topic, like protein folding and high-energy particle physics (although everyone knows about the Higgs boson nowadays!). And ensemble Kalman filters (x2). In particular, Andrew Stuart did a great job with his simulation movies. Even the final talk about path-sampling for quantum simulation was mostly understandable, at least the problematic of it.  Sadly, at this stage, I still cannot put a meaning on “quantum Monte Carlo”… (Incidentally, I do not think my own talk reached much of the audience, missing convincing examples I did not have time to present:)