Archive for tempering

parallel tempering on optimised paths

Posted in Statistics with tags , , , , , , , , , , , , , , , on May 20, 2021 by xi'an

Saifuddin Syed, Vittorio Romaniello, Trevor Campbell, and Alexandre Bouchard-Côté, whom I met and discussed with on my “last” trip to UBC, on December 2019, just arXived a paper on parallel tempering (PT), making the choice of tempering path an optimisation problem. They address the touchy issue of designing a sequence of tempered targets when the starting distribution π⁰, eg the prior, and the final distribution π¹, eg the posterior, are hugely different, eg almost singular.

“…theoretical analysis of reversible variants of PT has shown that adding too many intermediate chains can actually deteriorate performance (…) [while] on non reversible regime adding more chains is guaranteed to improve performances.”

The above applies to geometric combinations of π⁰ and π¹. Which “suffers from an arbitrarily suboptimal global communication barrier“, according to the authors (although the counterexample is not completely convincing since π⁰ and π¹ share the same variance). They propose a more non-linear form of tempering with constraints on the dependence of the powers on the temperature t∈(0,1).  Defining the global communication barrier as an average over temperatures of the rejection rate, the path characteristics (e.g., the coefficients of a spline function) can then be optimised in terms of this objective. And the temperature schedule is derived from the fact that the non-asymptotic round trip rate is maximized when the rejection rates are all equal. (As a side item, the technique exposed in the earlier tempering paper by Syed et al. was recently exploited for a night high resolution imaging of a black hole from the M87 galaxy.)

computational statistics and molecular simulation [18w5023]

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , , , , , on November 14, 2018 by xi'an

On Day 2, Carsten Hartmann used a representation of the log cumulant as solution to a minimisation problem over a collection of importance functions (by the Vonsker-Varadhan principle), with links to X entropy and optimal control, a theme also considered by Alain Dunmus when considering the uncorrected discretised Langevin diffusion with a decreasing sequence of discretisation scale factors (Jordan, Kinderlehrer and Otto) in the spirit of convex regularisation à la Rockafellar. Also representing ULA as an inexact gradient descent algorithm. Murray Pollock (Warwick) presented a new technique called fusion to simulate from products of d densities, as in scalable MCMC (but not only). With an (early) starting and startling remark that when simulating one realisation from each density in the product and waiting for all of them to be equal means simulating from the product, in a strong link to the (A)BC fundamentals. This is of course impractical and Murray proposes to follow d Brownian bridges all ending up in the average of these simulations, constructing an acceptance probability that is computable and validating the output.

The second “hand-on” lecture was given by Gareth Roberts (Warwick) on the many aspects of scaling MCMC algorithms, which started with the famous 0.234 acceptance rate paper in 1996. While I was aware of some of these results (!), the overall picture was impressive, including a notion of complexity I had not seen before. And a last section on PDMPs where Gareth presented very recent on the different scales of convergence of Zigzag and bouncy particle samplers, mostly to the advantage of Zigzag.In the afternoon, Jeremy Heng presented a continuous time version of simulated tempering by adding a drift to the Langevin diffusion with time-varying energy, which must be solution to the Liouville pde \text{div} \pi_t f = \partial_t \pi_t. Which connects to a flow transport problem when solving the pde under additional conditions. Unclear to me was the creation of the infinite sequence. This talk was very much at the interface in the spirit of the workshop! (Maybe surprisingly complex when considering the endpoint goal of simulating from a given target.) Jonathan Weare’s talk was about quantum chemistry which translated into finding eigenvalues of an operator. Turning in to a change of basis in a inhumanly large space (10¹⁸⁰ dimensions!). Matt Moore presented the work on Raman spectroscopy he did while a postdoc at Warwick, with an SMC based classification of the peaks of a spectrum (to be used on Mars?) and Alessandra Iacobucci (Dauphine) showed us the unexpected thermal features exhibited by simulations of chains of rotors subjected to both thermal and mechanical forcings, which we never discussed in Dauphine beyond joking on her many batch jobs running on our cluster!

And I remembered today that there is currently and in parallel another BIRS workshop on statistical model selection [and a lot of overlap with our themes] taking place in Banff! With snow already there! Unfair or rather #unfair, as someone much too well-known would whine..! Not that I am in a position to complain about the great conditions here in Oaxaca (except for having to truly worry about stray dogs rather than conceptually about bears makes running more of a challenge, if not the altitude since both places are about the same).

nested sampling when prior and likelihood clash

Posted in Books, Statistics with tags , , , , , , , , , on April 3, 2018 by xi'an

A recent arXival by Chen, Hobson, Das, and Gelderblom makes the proposal of a new nested sampling implementation when prior and likelihood disagree, making simulations from the prior inefficient. The paper holds the position that a single given prior is used over and over all datasets that come along:

“…in applications where one wishes to perform analyses on many thousands (or even millions) of different datasets, since those (typically few) datasets for which the prior is unrepresentative can absorb a large fraction of the computational resources.” Chen et al., 2018

My reaction to this situation, provided (a) I want to implement nested sampling and (b) I realise there is a discrepancy, would be to resort to an importance sampling resolution, as we proposed in our Biometrika paper with Nicolas. Since one objection [from the authors] is that identifying outlier datasets is complicated (it should not be when the likelihood function can be computed) and time-consuming, sequential importance sampling could be implemented.

“The posterior repartitioning (PR) method takes advantage of the fact that nested sampling makes use of the likelihood L(θ) and prior π(θ) separately in its exploration of the parameter space, in contrast to Markov chain Monte Carlo (MCMC) sampling methods or genetic algorithms which typically deal solely in terms of the product.” Chen et al., 2018

The above salesman line does not ring a particularly convincing chime in that nested sampling is about as myopic as MCMC since based on the similar notion of a local proposal move, starting from the lowest likelihood argument (the minimum likelihood estimator!) in the nested sample.

“The advantage of this extension is that one can choose (π’,L’) so that simulating from π’ under the constraint L'(θ) > l is easier than simulating from π under the constraint L(θ) > l. For instance, one may choose an instrumental prior π’ such that Markov chain Monte Carlo steps adapted to the instrumental constrained prior are easier to implement than with respect to the actual constrained prior. In a similar vein, nested importance sampling facilitates contemplating several priors at once, as one may compute the evidence for each prior by producing the same nested sequence, based on the same pair (π’,L’), and by simply modifying the weight function.” Chopin & Robert, 2010

Since the authors propose to switch to a product (π’,L’) such that π’.L’=π.L, the solution appears like a special case of importance sampling, with the added drwaback that when π’ is not normalised, its normalised constant must be estimated as well. (With an extra nested sampling implementation?) Furthermore, the advocated solution is to use tempering, which is not so obvious as it seems in small dimensions. As the mass does not always diffuse to relevant parts of the space. A more “natural” tempering would be to use a subsample in the (sub)likelihood for nested sampling and keep the remainder of the sample for weighting the evaluation of the evidence.

accelerating MCMC

Posted in Statistics with tags , , , , , , , , , , , , on May 29, 2017 by xi'an

I have recently [well, not so recently!] been asked to write a review paper on ways of accelerating MCMC algorithms for the [review] journal WIREs Computational Statistics and would welcome all suggestions towards the goal of accelerating MCMC algorithms. Besides [and including more on]

  • coupling strategies using different kernels and switching between them;
  • tempering strategies using flatter or lower dimensional targets as intermediary steps, e.g., à la Neal;
  • sequential Monte Carlo with particle systems targeting again flatter or lower dimensional targets and adapting proposals to this effect;
  • Hamiltonian MCMC, again with connections to Radford (and more generally ways of avoiding rejections);
  • adaptive MCMC, obviously;
  • Rao-Blackwellisation, just as obviously (in the sense that increasing the precision in the resulting estimates means less simulations).

ABC by population annealing

Posted in Statistics, University life with tags , , , , , , , , on January 6, 2015 by xi'an

The paper “Bayesian Parameter Inference and Model Selection by Population Annealing in System Biology” by Yohei Murakami got published in PLoS One last August but I only became aware of it when ResearchGate pointed it out to me [by mentioning one of our ABC papers was quoted there].

“We are recommended to try a number of annealing schedules to check the influence of the schedules on the simulated data (…) As a whole, the simulations with the posterior parameter ensemble could, not only reproduce the data used for parameter inference, but also capture and predict the data which was not used for parameter inference.”

Population annealing is a notion introduced by Y Iba, the very same IBA who introduced the notion of population Monte Carlo that we studied in subsequent papers. It reproduces the setting found in many particle filter papers of a sequence of (annealed or rather tempered) targets ranging from an easy (i.e., almost flat) target to the genuine target, and of an update of a particle set by MCMC moves and reweighing. I actually have trouble perceiving the difference with other sequential Monte Carlo schemes as those exposed in Del Moral, Doucet and Jasra (2006, Series B). And the same is true of the ABC extension covered in this paper. (Where the annealed intermediate targets correspond to larger tolerances.) This sounds like a traditional ABC-SMC algorithm. Without the adaptive scheme on the tolerance ε found e.g. in Del Moral et al., since the sequence is set in advance. [However, the discussion about the implementation includes the above quote that suggests a vague form of cross-validated tolerance construction]. The approximation of the marginal likelihood also sounds standard, the marginal being approximated by the proportion of accepted pseudo-samples. Or more exactly by the sum of the SMC weights at the end of the annealing simulation. This actually raises several questions: (a) this estimator is always between 0 and 1, while the marginal likelihood is not restricted [but this is due to a missing 1/ε in the likelihood estimate that cancels from both numerator and denominator]; (b) seeing the kernel as a non-parametric estimate of the likelihood led me to wonder why different ε could not be used in different models, in that the pseudo-data used for each model under comparison differs. If we were in a genuine non-parametric setting the bandwidth would be derived from the pseudo-data.

“Thus, Bayesian model selection by population annealing is valid.”

The discussion about the use of ABC population annealing somewhat misses the point of using ABC, which is to approximate the genuine posterior distribution, to wit the above quote: that the ABC Bayes factors favour the correct model in the simulation does not tell anything about the degree of approximation wrt the original Bayes factor. [The issue of non-consistent Bayes factors does not apply here as there is no summary statistic applied to the few observations in the data.] Further, the magnitude of the variability of the values of this Bayes factor as ε varies, from 1.3 to 9.6, mostly indicates that the numerical value is difficult to trust. (I also fail to explain the huge jump in Monte Carlo variability from 0.09 to 1.17 in Table 1.) That this form of ABC-SMC improves upon the basic ABC rejection approach is clear. However it needs to build some self-control to avoid arbitrary calibration steps and reduce the instability of the final estimates.

“The weighting function is set to be large value when the observed data and the simulated data are ‘‘close’’, small value when they are ‘‘distant’’, and constant when they are ‘‘equal’’.”

The above quote is somewhat surprising as the estimated likelihood f(xobs|xobs,θ) is naturally constant when xobs=xsim… I also failed to understand how the model intervened in the indicator function used as a default ABC kernel