## MCMC, variational inference, invertible flows… bridging the gap?

Posted in Books, Mountains, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , on October 2, 2020 by xi'an

Two weeks ago, my friend [see here when climbing Pic du Midi d’Ossau in 2005!] and coauthor Éric Moulines gave a very interesting on-line talk entitled MCMC, Variational Inference, Invertible Flows… Bridging the gap?, which was merging MCMC, variational autoencoders, and variational inference. I paid close attention as I plan to teach an advanced course on acronyms next semester in Warwick. (By acronyms, I mean ABC+GAN+VAE!)

The notion in this work is that variational autoencoders are based on over-simple mean-field variational distributions, that usually produce a poor approximation of the target distribution. Éric and his coauthors propose to introduce a Metropolis step in the VAE. This leads to a more general notion of Markov transitions and a global balance condition. Hamiltonian Monte Carlo can be used as well and it improves the latent distribution approximation, namely the encoder, which is surprising to me. The steps of the Markov kernel produce a manageable transform of the initial mean field approximation, a random version of the original VAE. Manageable provided not too many MCMC steps are implemented. (Now, the flow of slides was much too fast for me to get a proper understanding of the implementation of the method, of the degree of its calibration, and of the computing cost. I need to read the associated papers.)

Once the talk was over, I went back to changing tires and tubes, as two bikes of mine had flat tires, the latest being a spectacular explosion (!) that seemingly went through the tire (although I believe the opposite happened, namely the tire got slashed and induced the tube to blow out very quickly). Blame the numerous bits of broken glass over bike paths.

## adaptive and delayed MCMC for expensive likelihoods [reply from the authors]

Posted in Books, Statistics with tags , , , , , , , , on October 29, 2015 by xi'an

[Chris Sherlock, Andrew Golightly and Daniel Henderson have written a reply about my earlier comments on their arXived paper which works better as a post than as a comment:]

Thank you for the constructive criticism of our paper. Our approach uses a simple weighted average of nearest neighbours and we agree that GPs offer a useful alternative. Both methods have pros and cons, however we first note a similarity: Kriging using a GP also leads to a weighted average of values.

The two most useful pros of the GP are that, (i) by estimating the parameters of the GP one may represent the scales of variability more accurately than a simple nearest neighbour approach with weighting according to Euclidean distance, and (ii) one obtains a distribution for the uncertainty in the Kriging estimate of the log-likelihood.

Both the papers in the blog entry (as well as other recent papers which use GPs), in one way or another take advantage of the second point. However, as acknowledged in Richard Wilkinson’s paper, estimating the parameters of a GP is computationally very costly, and this estimation must be repeated as the training data set grows. Probably for this reason and because of the difficulty in identifying p(p+1)/2 kernel range parameters, Wilkinson’s paper uses a diagonal covariance structure for the kernel. We can find no description of the structure of the covariance function that is used for each statistic in the Meeds & Welling paper but this issue is difficult to avoid.

Our initial training run is used to transform the parameters so that they are approximately orthogonal with unit variance and Euclidean distance is a sensible metric. This has two consequences: (i) the KD-tree is easier to set up and use, and (ii) the nearest neighbours in a KD-tree that is approximately balanced can be found in O(log N) operations, where N is the number of training points. Both (i) and (ii) only require Euclidean distance to be a reasonable measure, not perfect, so there is no need for the training run to have “properly converged”, just for it to represent the gross relationships in the posterior and for the transformation to be 1-1. We note a parallel between our approximate standardisation using training data, and the need to estimate a symmetric matrix of distance parameters from training data to obtain a fully representative GP kernel.

The GP approach might lead to a more accurate estimate of the posterior than a nearest neighbour approach (for a fixed number of training points), but this is necessary for the algorithms in the papers mentioned above since they sample from an approximation to the posterior. As noted in the blog post the delayed-acceptance step (which also could be added to GP-based algorithms) ensures that our algorithm samples from the true posterior so accuracy is helpful for efficiency rather than essential for validity.

We have made the kd-tree C code available and put some effort into making the interface straightforward to use. Our starting point is an existing simple MCMC algorithm; as it is already evaluating the posterior (or an unbiased approximation) then why not store this and take advantage of it within the existing algorithm? We feel that our proposal offers a relatively cheap and straightforward route for this.

## adaptive and delayed MCMC for expensive likelihoods

Posted in Books, Statistics with tags , , , , , , , , on October 26, 2015 by xi'an

Chris Sherlock, Andrew Golightly and Daniel Henderson recently arXived a paper on a new kind of delayed acceptance.

“With simplicity in mind, we focus on a k-nearest neighbour regression model as the cheap surrogate.”

The central notion in the paper is to extrapolate from values of the likelihoods at a few points in the parameter space towards the whole space through a k-nearest neighbour estimate. While this solution is simple and relatively cheap to compute, it is unclear it is a good surrogate because it does not account for the structure of the model while depending on the choice of a distance. Recent works on Gaussian process approximations seem more relevant. See e.g. papers by Ed Meeds and Max Welling, or by Richard Wilkinson for ABC versions. Obviously, because this is a surrogate only for the first stage delayed acceptance (while the second stage is using the exact likelihood, as in our proposal), the approximation does not have to be super-tight. It should also favour the exploration of tails since (a) any proposal θ outside the current support of the chain is allocated a surrogate value that is the average of its k neighbours, hence larger than the true value in the tails, and (b) due to the delay a larger scale can be used in the random walk proposal. As the authors acknowledge, the knn method deteriorates quickly with the dimension. And computing the approximation grows with the number of MCMC iterations, given that the algorithm is adaptive and uses the exact likelihood values computed so far. Only for the first stage approximation, though, which explains “why” the delayed acceptance algorithm converges. I wondered for a short while whether this was enough to justify convergence, given that the original Metropolis-Hastings probability is just broken into two parts. Since the second stage compensates for the use of a surrogate on the first step, it should not matter in the end. However, the rejection of a proposal still depends on this approximation, i.e., differs from the original algorithm, and hence is turning the Markov chain into a non-Markovian process.

“The analysis sheds light on how computationally cheap the deterministic approximation needs to be to make its use worthwhile and on the relative importance of it matching the `location’ and curvature of the target.”

I had missed the “other” paper by some of the authors on the scaling of delayed acceptance, where they “assume that the error in the cheap deterministic approximation is a realisation of a random function” (p.3).  In which they provide an optimal scaling result for high dimensions à la Roberts et al. (1997), namely a scale of 2.38 (times the target scale) in the random walk proposal. The paper however does not describe the cheap approximation to the target or pseudo-marginal version.

A large chunk of the paper is dedicated to the construction and improvement of the KD-tree used to find the k nearest neighbours. In O(d log(n)) time. Algorithm on which I have no specific comment. Except maybe that the construction of a KD-tree in accordance with a Mahalanobis distance discussed in Section 2.1 requires that the MCMC algorithm has properly converged, which is unrealistic. And also that the construction of a balanced tree seems to require heavy calibrations.

The paper is somewhat harder to read than need be (?) because the authors cumulate the idea of delayed acceptance based on this knn approximation with the technique of pseudo-marginal Metropolis-Hastings. While there is an added value in doing so it complexifies the exposition. And leads to ungainly acronyms like adaptive “da-PsMMH”, which simply are un-readable (!).

I would suggest some material to be published as supplementary material and the overall length of the paper to be reduced. For instance, Section 4.2 is not particularly conclusive. See, e.g., Theorem 2. Or the description of the simulated models in Section 5, which is sometimes redundant.