Archive for adaptive MCMC methods

adaptive incremental mixture MCMC

Posted in Statistics with tags , , , , , , , on August 12, 2022 by xi'an

Sadly, I missed this adaptive incremental mixture MCMC paper by my friends Florian Maire, Nial Friel, Antonietta Mira, and Adrian E. Raftery when it came out in JCGS in 2019. The core of the paper is about building a time-inhomogeneous mixture independent proposal, starting from an initial distribution and adding one component when hitting a point for which the ratio target / proposal is large, as this points out a part of the space that is not well-enough explored, while the other components do not change, except for a proportional decrease in the weights. This proposal reminded me of the inspiring paper of Gåsemyr (2003), which in some ways inspired our population Monte Carlo sampler. Obviously, there is a what-you-get-is-what-you-see drawback to the approach in that regions where this ratio is high may never be explored by the proposal, despite its adaptivity.

The added component is Normal, centred at the associated (accepted) proposed value ø and with covariance matrix a local estimate based on past iterations of the algorithm. And with weight proportional to the (powered) target density at ø, which does not require a normalising constant. The method however requires setting a certain number of calibration parameters like the power γ for the weight, the lower bound M for the ratio target to proposal, the rate of diminishing adaptation (which is also needed for ergodicity à la Roberts and Rosenthal (2007)).  And the implicit choice of a particular parameterisation for the Normal mixture to be close enough to the target. In the posted experiments, the number of components in the mixture does not grow to unmanageable figures, but a further adaption could be in removing components that are inactive or leading to systematic rejection as we did in the population Monte Carlo paper.

automatic adaptation of MCMC algorithms

Posted in pictures, Statistics with tags , , , , , , , on March 4, 2019 by xi'an

“A typical adaptive MCMC sampler will approximately optimize performance given the kind of sampler chosen in the first place, but it will not optimize among the variety of samplers that could have been chosen.”

Last February (2018), Dao Nguyen and five co-authors arXived a paper that I missed. On a new version of adaptive MCMC that aims at selecting a wider range of proposal kernels. Still requiring a by-hand selection of this collection of kernels… Among the points addressed, beyond the theoretical guarantees that the adaptive scheme does not jeopardize convergence to the proper target, are a meta-exploration of the set of combinations of samplers and integration of the computational speed in the assessment of each sampler. Including the very difficulty of assessing mixing. One could deem the index of the proposal as an extra (cyber-)parameter to its generic parameter (like the scale in the random walk), but the discreteness of this index makes the extension more delicate than expected. And justifies the distinction between internal and external parameters. The notion of a worst-mixing dimension is quite appealing and connects with the long-term hope that one could spend the maximum fraction of the sampler runtime over the directions that are poorly mixing, while still keeping the target as should be. The adaptive scheme is illustrated on several realistic models with rather convincing gains in efficiency and time.

The convergence tools are inspired from Roberts and Rosenthal (2007), with an assumption of uniform ergodicity over all kernels considered therein which is both strong and delicate to assess in practical settings. Efficiency is rather unfortunately defined in terms of effective sample size, which is a measure of correlation or lack thereof, but which does not relate to the speed of escape from the basin of attraction of the starting point. I also wonder at the pertinence of the estimation of the effective sample size when the chain is based on different successive kernels, since the lack of correlation may be due to another kernel. Another calibration issue is the internal clock that relates to the average number of iterations required to tune properly a specific kernel, which again may be difficult to assess in a realistic situation. A last query is whether or not this scheme could be compared with an asynchronous (and valid) MCMC approach that exploits parallel capacities of the computer.

Bayesian inference with intractable normalizing functions

Posted in Books, Statistics with tags , , , , , , , , , , , on December 13, 2018 by xi'an

In the latest September issue of JASA I received a few days ago, I spotted a review paper by Jaewoo Park & Murali Haran on intractable normalising constants Z(θ). There have been many proposals for solving this problem as well as several surveys, some conferences and even a book. The current survey focus on MCMC solutions, from auxiliary variable approaches to likelihood approximation algorithms (albeit without ABC entries, even though the 2006 auxiliary variable solutions of Møller et al. et of Murray et al. do simulate pseudo-observations and hence…). This includes the MCMC approximations to auxiliary sampling proposed by Faming Liang and co-authors across several papers. And the paper Yves Atchadé, Nicolas Lartillot and I wrote ten years ago on an adaptive MCMC targeting Z(θ) and using stochastic approximation à la Wang-Landau. Park & Haran stress the relevance of using sufficient statistics in this approach towards fighting computational costs, which makes me wonder if an ABC version could be envisioned.  The paper also includes pseudo-marginal techniques like Russian Roulette (once spelled Roullette) and noisy MCMC as proposed in Alquier et al.  (2016). These methods are compared on three examples: (1) the Ising model, (2) a social network model, the Florentine business dataset used in our original paper, and a larger one where most methods prove too costly, and (3) an attraction-repulsion point process model. In conclusion, an interesting survey, taking care to spell out the calibration requirements and the theoretical validation, if of course depending on the chosen benchmarks.

MCM17 snapshots

Posted in Kids, Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , on July 5, 2017 by xi'an

At MCM2017 today, Radu Craiu presented a talk on adaptive Metropolis-within-Gibbs, using a family of proposals for each component of the target and weighting them by jumping distance. And managing the adaptation from the selection rate rather than from the acceptance rate as we did in population Monte Carlo. I find the approach quite interesting in that adaptation and calibration of Metropolis-within-Gibbs is quite challenging due to the conditioning, i.e., the optimality of one scale is dependent on the other components. Some of the graphs produced by Radu during the talk showed a form of local adaptivity that seemed promising. This raised a question I could not ask for lack of time, namely that with a large enough collection of proposals, it is unclear why this approach provides a gain compared with particle, sequential or population Monte Carlo algorithms. Indeed, when there are many parallel proposals, clouds of particles can be generated from all proposals in proportion to their appeal and merged together in an importance manner, leading to an easier adaptation. As it went, the notion of local scaling also reflected in Mylène Bédard’s talk on another Metropolis-within-Gibbs study of optimal rates. The other interesting sessions I attended were the ones on importance sampling with stochastic gradient optimisation, organised by Ingmar Schuster, and on sequential Monte Carlo, with a divide-and-conquer resolution through trees by Lindsten et al. I had missed.

adaptive exchange

Posted in Books, Statistics, University life with tags , , , , , , , , , , on October 27, 2016 by xi'an

In the March 2016 issue of JASA that currently sits on my desk, there is a paper by Liang, Jim, Song and Liu on the adaptive exchange algorithm, which aims at handling posteriors for sampling distributions with intractable normalising constants. The concept behind the algorithm is the exchange principle initiated by Jesper Møller and co-authors in 2006, where an auxiliary pseudo-observation is simulated for the missing constants to vanish in a Metropolis-Hastings ratio. (The name exchangeable was introduced in a subsequent paper by Iain Murray, Zoubin Ghahramani and David MacKay, also in 2006.)

 The crux of the method is to run an iteration as [where y denotes the observation]

  1. Proposing a new value θ’ of the parameter from a proposal q(θ’|θ);
  2. Generate a pseudo-observation z~ƒ(z|θ’);
  3. Accept with probability

\dfrac{\pi(\theta')f(y|\theta')}{\pi(\theta)f(y|\theta)}\dfrac{q(\theta|\theta')f(z|\theta)}{q(\theta'|\theta)f(z|\theta')}

which has the appeal to cancel all normalising constants. And the repeal of requiring an exact simulation from the very distribution with the missing constant, ƒ(.|θ). Which means that in practice a finite number of MCMC steps will be used and will bias the outcome. The algorithm is unusual in that it replaces the exact proposal q(θ’|θ) with an unbiased random version q(θ’|θ)ƒ(z|θ’), z being just an augmentation of the proposal. (The current JASA paper by Liang et al. seems to confuse augment and argument, see p.378.)

To avoid the difficulty in simulating from ƒ(.|θ), the authors draw pseudo-observations from sampling distributions with a finite number m of parameter values under the [unrealistic] assumption (A⁰) that this collection of values provides an almost complete cover of the posterior support. One of the tricks stands with an auxiliary [time-heterogeneous] chain of pseudo-observations generated by single Metropolis steps from one of these m fixed targets. These pseudo-observations are then used in the main (or target) chain to define the above exchange probability. The auxiliary chain is Markov but time-heterogeneous since the probabilities of accepting a move are evolving with time according to a simulated annealing schedule. Which produces a convergent estimate of the m normalising constants. The main chain is not Markov in that it depends on the whole history of the auxiliary chain [see Step 5, p.380]. Even jointly the collection of both chains is not Markov. The paper prefers to consider the process as an adaptive Markov chain. I did not check the rather intricate in details, so cannot judge of the validity of the overall algorithm; I simply note that one condition (A², p.383) is incredibly strong in that it assumes the Markov transition kernel to be Doeblin uniformly on any compact set of the calibration parameters. However, the major difficulty with this approach seems to be in its delicate calibration. From providing a reference set of m parameter values scanning the posterior support to picking transition kernels on both the parameter and the sample spaces, to properly cooling the annealing schedule [always a fun part!], there seems to be [from my armchair expert’s perspective, of course!] a wide range of opportunities for missing the target or running into zero acceptance problems. Both examples analysed in the paper, the auto-logistic and the auto-normal models, are actually of limited complexity in that they depend on a few parameters, 2 and 4 resp., and enjoy sufficient statistics, of dimensions 2 and 4 as well. Hence simulating (pseudo-)realisations of those sufficient statistics should be less challenging than the original approach replicating an entire vector of thousands of dimensions.