Archive for adaptive MCMC methods

automatic adaptation of MCMC algorithms

Posted in pictures, Statistics with tags , , , , , , , on March 4, 2019 by xi'an

“A typical adaptive MCMC sampler will approximately optimize performance given the kind of sampler chosen in the first place, but it will not optimize among the variety of samplers that could have been chosen.”

Last February (2018), Dao Nguyen and five co-authors arXived a paper that I missed. On a new version of adaptive MCMC that aims at selecting a wider range of proposal kernels. Still requiring a by-hand selection of this collection of kernels… Among the points addressed, beyond the theoretical guarantees that the adaptive scheme does not jeopardize convergence to the proper target, are a meta-exploration of the set of combinations of samplers and integration of the computational speed in the assessment of each sampler. Including the very difficulty of assessing mixing. One could deem the index of the proposal as an extra (cyber-)parameter to its generic parameter (like the scale in the random walk), but the discreteness of this index makes the extension more delicate than expected. And justifies the distinction between internal and external parameters. The notion of a worst-mixing dimension is quite appealing and connects with the long-term hope that one could spend the maximum fraction of the sampler runtime over the directions that are poorly mixing, while still keeping the target as should be. The adaptive scheme is illustrated on several realistic models with rather convincing gains in efficiency and time.

The convergence tools are inspired from Roberts and Rosenthal (2007), with an assumption of uniform ergodicity over all kernels considered therein which is both strong and delicate to assess in practical settings. Efficiency is rather unfortunately defined in terms of effective sample size, which is a measure of correlation or lack thereof, but which does not relate to the speed of escape from the basin of attraction of the starting point. I also wonder at the pertinence of the estimation of the effective sample size when the chain is based on different successive kernels, since the lack of correlation may be due to another kernel. Another calibration issue is the internal clock that relates to the average number of iterations required to tune properly a specific kernel, which again may be difficult to assess in a realistic situation. A last query is whether or not this scheme could be compared with an asynchronous (and valid) MCMC approach that exploits parallel capacities of the computer.

Bayesian inference with intractable normalizing functions

Posted in Books, Statistics with tags , , , , , , , , , , , on December 13, 2018 by xi'an

In the latest September issue of JASA I received a few days ago, I spotted a review paper by Jaewoo Park & Murali Haran on intractable normalising constants Z(θ). There have been many proposals for solving this problem as well as several surveys, some conferences and even a book. The current survey focus on MCMC solutions, from auxiliary variable approaches to likelihood approximation algorithms (albeit without ABC entries, even though the 2006 auxiliary variable solutions of Møller et al. et of Murray et al. do simulate pseudo-observations and hence…). This includes the MCMC approximations to auxiliary sampling proposed by Faming Liang and co-authors across several papers. And the paper Yves Atchadé, Nicolas Lartillot and I wrote ten years ago on an adaptive MCMC targeting Z(θ) and using stochastic approximation à la Wang-Landau. Park & Haran stress the relevance of using sufficient statistics in this approach towards fighting computational costs, which makes me wonder if an ABC version could be envisioned.  The paper also includes pseudo-marginal techniques like Russian Roulette (once spelled Roullette) and noisy MCMC as proposed in Alquier et al.  (2016). These methods are compared on three examples: (1) the Ising model, (2) a social network model, the Florentine business dataset used in our original paper, and a larger one where most methods prove too costly, and (3) an attraction-repulsion point process model. In conclusion, an interesting survey, taking care to spell out the calibration requirements and the theoretical validation, if of course depending on the chosen benchmarks.

MCM17 snapshots

Posted in Kids, Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , on July 5, 2017 by xi'an

At MCM2017 today, Radu Craiu presented a talk on adaptive Metropolis-within-Gibbs, using a family of proposals for each component of the target and weighting them by jumping distance. And managing the adaptation from the selection rate rather than from the acceptance rate as we did in population Monte Carlo. I find the approach quite interesting in that adaptation and calibration of Metropolis-within-Gibbs is quite challenging due to the conditioning, i.e., the optimality of one scale is dependent on the other components. Some of the graphs produced by Radu during the talk showed a form of local adaptivity that seemed promising. This raised a question I could not ask for lack of time, namely that with a large enough collection of proposals, it is unclear why this approach provides a gain compared with particle, sequential or population Monte Carlo algorithms. Indeed, when there are many parallel proposals, clouds of particles can be generated from all proposals in proportion to their appeal and merged together in an importance manner, leading to an easier adaptation. As it went, the notion of local scaling also reflected in Mylène Bédard’s talk on another Metropolis-within-Gibbs study of optimal rates. The other interesting sessions I attended were the ones on importance sampling with stochastic gradient optimisation, organised by Ingmar Schuster, and on sequential Monte Carlo, with a divide-and-conquer resolution through trees by Lindsten et al. I had missed.

adaptive exchange

Posted in Books, Statistics, University life with tags , , , , , , , , , , on October 27, 2016 by xi'an

In the March 2016 issue of JASA that currently sits on my desk, there is a paper by Liang, Jim, Song and Liu on the adaptive exchange algorithm, which aims at handling posteriors for sampling distributions with intractable normalising constants. The concept behind the algorithm is the exchange principle initiated by Jesper Møller and co-authors in 2006, where an auxiliary pseudo-observation is simulated for the missing constants to vanish in a Metropolis-Hastings ratio. (The name exchangeable was introduced in a subsequent paper by Iain Murray, Zoubin Ghahramani and David MacKay, also in 2006.)

 The crux of the method is to run an iteration as [where y denotes the observation]

  1. Proposing a new value θ’ of the parameter from a proposal q(θ’|θ);
  2. Generate a pseudo-observation z~ƒ(z|θ’);
  3. Accept with probability

\dfrac{\pi(\theta')f(y|\theta')}{\pi(\theta)f(y|\theta)}\dfrac{q(\theta|\theta')f(z|\theta)}{q(\theta'|\theta)f(z|\theta')}

which has the appeal to cancel all normalising constants. And the repeal of requiring an exact simulation from the very distribution with the missing constant, ƒ(.|θ). Which means that in practice a finite number of MCMC steps will be used and will bias the outcome. The algorithm is unusual in that it replaces the exact proposal q(θ’|θ) with an unbiased random version q(θ’|θ)ƒ(z|θ’), z being just an augmentation of the proposal. (The current JASA paper by Liang et al. seems to confuse augment and argument, see p.378.)

To avoid the difficulty in simulating from ƒ(.|θ), the authors draw pseudo-observations from sampling distributions with a finite number m of parameter values under the [unrealistic] assumption (A⁰) that this collection of values provides an almost complete cover of the posterior support. One of the tricks stands with an auxiliary [time-heterogeneous] chain of pseudo-observations generated by single Metropolis steps from one of these m fixed targets. These pseudo-observations are then used in the main (or target) chain to define the above exchange probability. The auxiliary chain is Markov but time-heterogeneous since the probabilities of accepting a move are evolving with time according to a simulated annealing schedule. Which produces a convergent estimate of the m normalising constants. The main chain is not Markov in that it depends on the whole history of the auxiliary chain [see Step 5, p.380]. Even jointly the collection of both chains is not Markov. The paper prefers to consider the process as an adaptive Markov chain. I did not check the rather intricate in details, so cannot judge of the validity of the overall algorithm; I simply note that one condition (A², p.383) is incredibly strong in that it assumes the Markov transition kernel to be Doeblin uniformly on any compact set of the calibration parameters. However, the major difficulty with this approach seems to be in its delicate calibration. From providing a reference set of m parameter values scanning the posterior support to picking transition kernels on both the parameter and the sample spaces, to properly cooling the annealing schedule [always a fun part!], there seems to be [from my armchair expert’s perspective, of course!] a wide range of opportunities for missing the target or running into zero acceptance problems. Both examples analysed in the paper, the auto-logistic and the auto-normal models, are actually of limited complexity in that they depend on a few parameters, 2 and 4 resp., and enjoy sufficient statistics, of dimensions 2 and 4 as well. Hence simulating (pseudo-)realisations of those sufficient statistics should be less challenging than the original approach replicating an entire vector of thousands of dimensions.

adaptive and delayed MCMC for expensive likelihoods [reply from the authors]

Posted in Books, Statistics with tags , , , , , , , , on October 29, 2015 by xi'an

[Chris Sherlock, Andrew Golightly and Daniel Henderson have written a reply about my earlier comments on their arXived paper which works better as a post than as a comment:]

Thank you for the constructive criticism of our paper. Our approach uses a simple weighted average of nearest neighbours and we agree that GPs offer a useful alternative. Both methods have pros and cons, however we first note a similarity: Kriging using a GP also leads to a weighted average of values.

The two most useful pros of the GP are that, (i) by estimating the parameters of the GP one may represent the scales of variability more accurately than a simple nearest neighbour approach with weighting according to Euclidean distance, and (ii) one obtains a distribution for the uncertainty in the Kriging estimate of the log-likelihood.

Both the papers in the blog entry (as well as other recent papers which use GPs), in one way or another take advantage of the second point. However, as acknowledged in Richard Wilkinson’s paper, estimating the parameters of a GP is computationally very costly, and this estimation must be repeated as the training data set grows. Probably for this reason and because of the difficulty in identifying p(p+1)/2 kernel range parameters, Wilkinson’s paper uses a diagonal covariance structure for the kernel. We can find no description of the structure of the covariance function that is used for each statistic in the Meeds & Welling paper but this issue is difficult to avoid.

Our initial training run is used to transform the parameters so that they are approximately orthogonal with unit variance and Euclidean distance is a sensible metric. This has two consequences: (i) the KD-tree is easier to set up and use, and (ii) the nearest neighbours in a KD-tree that is approximately balanced can be found in O(log N) operations, where N is the number of training points. Both (i) and (ii) only require Euclidean distance to be a reasonable measure, not perfect, so there is no need for the training run to have “properly converged”, just for it to represent the gross relationships in the posterior and for the transformation to be 1-1. We note a parallel between our approximate standardisation using training data, and the need to estimate a symmetric matrix of distance parameters from training data to obtain a fully representative GP kernel.

The GP approach might lead to a more accurate estimate of the posterior than a nearest neighbour approach (for a fixed number of training points), but this is necessary for the algorithms in the papers mentioned above since they sample from an approximation to the posterior. As noted in the blog post the delayed-acceptance step (which also could be added to GP-based algorithms) ensures that our algorithm samples from the true posterior so accuracy is helpful for efficiency rather than essential for validity.

We have made the kd-tree C code available and put some effort into making the interface straightforward to use. Our starting point is an existing simple MCMC algorithm; as it is already evaluating the posterior (or an unbiased approximation) then why not store this and take advantage of it within the existing algorithm? We feel that our proposal offers a relatively cheap and straightforward route for this.

adaptive and delayed MCMC for expensive likelihoods

Posted in Books, Statistics with tags , , , , , , , , on October 26, 2015 by xi'an

Chris Sherlock, Andrew Golightly and Daniel Henderson recently arXived a paper on a new kind of delayed acceptance.

“With simplicity in mind, we focus on a k-nearest neighbour regression model as the cheap surrogate.”

The central notion in the paper is to extrapolate from values of the likelihoods at a few points in the parameter space towards the whole space through a k-nearest neighbour estimate. While this solution is simple and relatively cheap to compute, it is unclear it is a good surrogate because it does not account for the structure of the model while depending on the choice of a distance. Recent works on Gaussian process approximations seem more relevant. See e.g. papers by Ed Meeds and Max Welling, or by Richard Wilkinson for ABC versions. Obviously, because this is a surrogate only for the first stage delayed acceptance (while the second stage is using the exact likelihood, as in our proposal), the approximation does not have to be super-tight. It should also favour the exploration of tails since (a) any proposal θ outside the current support of the chain is allocated a surrogate value that is the average of its k neighbours, hence larger than the true value in the tails, and (b) due to the delay a larger scale can be used in the random walk proposal. As the authors acknowledge, the knn method deteriorates quickly with the dimension. And computing the approximation grows with the number of MCMC iterations, given that the algorithm is adaptive and uses the exact likelihood values computed so far. Only for the first stage approximation, though, which explains “why” the delayed acceptance algorithm converges. I wondered for a short while whether this was enough to justify convergence, given that the original Metropolis-Hastings probability is just broken into two parts. Since the second stage compensates for the use of a surrogate on the first step, it should not matter in the end. However, the rejection of a proposal still depends on this approximation, i.e., differs from the original algorithm, and hence is turning the Markov chain into a non-Markovian process.

“The analysis sheds light on how computationally cheap the deterministic approximation needs to be to make its use worthwhile and on the relative importance of it matching the `location’ and curvature of the target.”

I had missed the “other” paper by some of the authors on the scaling of delayed acceptance, where they “assume that the error in the cheap deterministic approximation is a realisation of a random function” (p.3).  In which they provide an optimal scaling result for high dimensions à la Roberts et al. (1997), namely a scale of 2.38 (times the target scale) in the random walk proposal. The paper however does not describe the cheap approximation to the target or pseudo-marginal version.

A large chunk of the paper is dedicated to the construction and improvement of the KD-tree used to find the k nearest neighbours. In O(d log(n)) time. Algorithm on which I have no specific comment. Except maybe that the construction of a KD-tree in accordance with a Mahalanobis distance discussed in Section 2.1 requires that the MCMC algorithm has properly converged, which is unrealistic. And also that the construction of a balanced tree seems to require heavy calibrations.

The paper is somewhat harder to read than need be (?) because the authors cumulate the idea of delayed acceptance based on this knn approximation with the technique of pseudo-marginal Metropolis-Hastings. While there is an added value in doing so it complexifies the exposition. And leads to ungainly acronyms like adaptive “da-PsMMH”, which simply are un-readable (!).

I would suggest some material to be published as supplementary material and the overall length of the paper to be reduced. For instance, Section 4.2 is not particularly conclusive. See, e.g., Theorem 2. Or the description of the simulated models in Section 5, which is sometimes redundant.

debunking a (minor and personal) myth

Posted in Books, Kids, R, Statistics, University life with tags , , , , on September 9, 2015 by xi'an

diriXFor quite a while, I entertained the idea that Beta and Dirichlet proposals  were more adequate than (log-)normal random walks proposals for parameters on (0,1) and simplicia (simplices, simplexes), respectively, when running an MCMC. For instance, for p in (0,1) the value of the Markov chain at time t-1, the proposal at time t could be a Be(εp,ε{1-p}) generator, since its mean is equal to p and its variance is proportional to 1/(1+ε). (Although I cannot find track of this notion in my books.) The parameter ε can be calibrated towards a given acceptance rate, like the golden number 0.234 of Gelman, Gilks and Roberts (1996). However, when using this proposal on a mixture model, Kaniav Kamari and myself realised today that there is a catch, namely that pushing ε down to achieve an acceptance rate near 0.234 may end up in disaster, since the parameters of the Beta or of the Dirichlet may become lower than 1, which implies an infinite explosion on some boundaries of the parameter space. An explosion that gets more and more serious as ε decreases to zero. Hence is more and more likely to decrease the acceptance rate, thus to reduce ε, which in turns concentrates even more the support on the boundary and leads to a vicious circle and no convergence to the target acceptance rate… Continue reading