If only someone has written a paper about exactly the kind of geometry you get in hierarchical models with different parameterizations and how they interact with various HMC schemes…

]]>Using KMC on a joint space in a hierarchical model will break it — too many dimensions! But why would one do that? If HMC is available, use it! We include HMC results in some of the experiments and it naturally outperforms KMC or other energy surrogates (modulo some potential speed up on “tall data” or easy models, as in Babak’s paper). But what to do when gradients are not available? Fillipone & Girolami, PAMI, 2014 for example random walk on the GPC marginal posterior, and KMC for sure can do better than that.

]]>Thanks for the flowers on the name, Dan ;)

Off topic: http://www.reddit.com/r/Physics/comments/915mc/yes_kamiltonian_is_a_real_term_in_physics/

Adaptation and ergodicity.

We certainly agree that the naive approach of using a non-parametric kernel density estimator on the chain history (as in [Christian’s book, Example 8.8](http://goo.gl/vMHEpy)) as a *proposal* fails spectacularly on simple examples: the probability of proposing in unexplored regions is extremely small, independent of the current position of the MCMC trajectory. This is not what we do though. Instead, we use the gradient of a density estimator, and not the density itself, for our HMC proposal. Just like [KAMH](http://arxiv.org/abs/1307.5302), KMC lite in fact falls back to Random Walk Metropolis in previously unexplored regions and therefore inherits geometric ergodicity properties. This in particular includes the ability to explore previously “unseen” regions, even if adaptation has stopped. I implemented a simple illustration and comparison [here](http://nbviewer.ipython.org/gist/karlnapf/da0089726c43ed52a899).

ABC example.

The main point of the ABC example, is that our method does not suffer from the additional bias from Gaussian synthetic likelihoods when being confronted with skewed models. But there is also a computational efficiency aspect. The scheme by [Meeds et al.](https://xianblog.wordpress.com/2015/03/13/hamiltonian-abc/) relies on finite differences and requires $2D$ simulations from the likelihood *every time* the gradient is evaluated (i.e. every leapfrog iteration) and H-ABC discards this valuable information subsequently. In contrast, KMC accumulates gradient information from simulations: it only requires to simulate from the likelihood *once* in the accept/reject step after the leapfrog integration (where gradients are available in closed form). The density is only updated then, and not during the leapfrog integration. Similar work on speeding up HMC via energy surrogates can be applied in the [tall data scenario](http://arxiv.org/abs/1504.01418).

Monte Carlo gradients.

Approximating HMC when gradients aren’t available is in general a difficult problem. One approach (like surrogate models) may work well in some scenarios while a different approach (i.e. Monte Carlo) may work better in others, and the ABC example showcases such a case. We very much doubt that one size will fit all — but rather claim that it is of interest to find and document these scenarios.

Michael raised the concern that intractable gradients in the Pseudo-Marginal case can be avoided by running an MCMC chain on the joint space (e.g. $(f,\theta)$ for the GP classifier). To us, however, the situation is not that clear. In many cases, the correlations between variables can cause convergence problems (see e.g. [here](http://arxiv.org/pdf/1301.2878.pdf)) for the MCMC and have to be addressed by de-correlation schemes (as [here](http://papers.nips.cc/paper/4114-slice-sampling-covariance-hyperparameters-of-latent-gaussian-models)), or e.g. by incorporating [geometric information](http://www.dcs.gla.ac.uk/inference/rmhmc/), which also needs fixes as [Michaels’s very own one](http://arxiv.org/abs/1212.4693). Which is the method of choice with a particular statistical problem at hand? Which method gives the smallest estimation error (if that is the goal?) for a given problem? Estimation error per time? A thorough comparison of these different classes of algorithms in terms of performance related to problem class would help here. Most papers (including ours) only show experiments favouring their own method.

GP estimator quality

Finally, to address Michael’s point on the consistency of the GP estimator of the density gradient: this is discussed In the [original paper on the infinite dimensional exponential family](http://arxiv.org/pdf/1312.3516v3.pdf). As Michael points out, higher dimensional problems are unavoidably harder, however the specific details are rather involved. First, in terms of theory: both the well-specified case (when the natural parameter is in the RKHS, Section 4), and the ill-specified case (the natural parameter is in a “reasonable”, larger class of functions, Section 5), the estimate is consistent. Consistency is obtained in various metrics, including the $L^2$ error on gradients. The rates depend on how smooth the natural parameter is (and indeed a poor choice of hyper-parameter will mean slower convergence). The key point, in regards to Michael’s question, is that the smoothness requirement becomes more restrictive as the dimension increases: see Section 4.2, “range space assumption”.

Second, in terms of practice: we have found in experiments that the infinite dimensional exponential family does perform considerably better than a kernel density estimator when the dimension increases (Section 6). In other words, our density estimator can take advantage of smoothness properties of the “true” target density to get good convergence rates. As a practical strategy for hyper-parameter choice, we cross-validate, which works well empirically despite being distasteful to Bayesians. Experiments in the KMC paper also indicate that we can scale these estimators up to dimensions in the 100s on Laptop computers (unlike most other gradient estimation techniques in HMC, e.g. the ones in your [HMC & sub-sampling note](https://goo.gl/1a1Q8v), or the finite differences in [Meeds et al](https://xianblog.wordpress.com/2015/03/13/hamiltonian-abc/)).

All the best

]]>Incidentally, the ridge in the likelihood makes the performance of RWM for theta| y highly prior dependent, so it is a default, but it’s not a good one…

]]>Arbitrary, no. Three level hierarchy with a big Gaussian bit at the second level is classical (I’m aware of versions since the late 90s). See, for example, all disease mapping, spatial statistics and some point process literature.

]]>Not a Gibbs scheme. theta | y (marginalising out latent x), then x | theta, y.

We did the sampler for the for the theta | y step. Clearly this is a reasonable way to go as it breaks the posterior dependence between theta and x that can cause a lot of trouble.

The state of the art suggested in previous papers was RWM for this bit, so we show that KMC did a little better.

I’m not making the argument that you can do joint sampling – we’ll have to agree to disagree on that meaning I can’t make the previous claim.

Personally I have never seen a Markov chain Monte Carlo scheme (including Riemann Manifold Hamiltonian Monte Carlo, and their are actually several contributions on this issue in the discussions to the read paper), which can handle joint sampling of hyperparameters and latent data in an arbitrarily complex hierarchical model. There are so many complex dependencies that the posterior will look so peculiar – constructing a general purpose sampler that can handle all of those and navigate to regions of interest seems to me to be an extremely difficult task. So I would raise my eyebrows at any MCMC scheme claiming to do that. If you know one though please let me know!

]]>Thinking a little bit more about it, if you’ve used a pseudomarginal method to marginalise out part of the hierarchy, I’m pretty sure you can’t argue that it works on a hierarchical model. To make an argument that it works for this as a hierarchical model, you’d have to use KMC to *jointly* estimate the function and the parameters.

]]>It’s not clear to me how you did that example. Is it a Gibbs type thing i.e.

sample theta | x, y with KMC

sample x | y, theta with KMC

You mention pseudomarginal methods, but I can’t for the life of me work out why you’d use that sledgehammer for such a simple problem…

A joint accept/reject scheme is fine.

]]>