Archive for Barker’s algorithm

transport, diffusions, and sampling

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , , , , on November 19, 2022 by xi'an

At the Sampling, Transport, and Diffusions workshop at the Flatiron Institute, on Day #2, Marilou Gabrié (École Polytechnique) gave the second introductory lecture on merging sampling and normalising flows targeting the target distribution, when driven by a divergence criterion like KL, that only requires the shape of the target density. I first wondered about ergodicity guarantees in simultaneous MCMC and map training due to the adaptation of the flow but the update of the map only depends on the current particle cloud in (8). From an MCMC perspective, it sounds somewhat paradoxical to see the independent sampler making such an unexpected come-back when considering that no insider information is available about the (complex) posterior to drive the [what-you-get-is-what-you-see] construction of the transport map. However, the proposed approach superposed local (random-walk like) and global (transport) proposals in Algorithm 1.

Qiang Liu followed on learning transport maps, with the  Interesting notion of causalizing a graph by removing intersections (which are impossible for an ODE, as discussed by Eric Vanden-Eijden’s talk yesterday) through  coupling. Which underlies his notion of rectified flows. Possibly connecting with the next lightning talk by Jonathan Weare on spurious modes created by a variational Monte Carlo sampler and the use of stochastic gradient, corrected by (case-dependent?) regularisation.

Then came a whole series of MCMC talks!

Sam Livingstone spoke on Barker’s proposal (an incoming Biometrika paper!) as part of a general class of transforms g of the MH ratio, using jump processes based on a nasty normalising constant related with g (tractable for the original Barker algorithm). I then realised I had missed his StatSci paper on how to speak to statistical physics researchers!

Charles Margossian spoke about using a massive number of short parallel runs (many-short-chain regime) from a recent paper written with Aki,  Andrew, and Lionel Riou-Durand (Warwick) among others. Which brings us back to the challenge of producing convergence diagnostics and precisely the Gelman-Rubin R statistic or its recent nR avatar (with its linear limitations and dependence on parameterisation, as opposed to fuller distributional criteria). The core of the approach is in using blocks of GPUs to improve and speed-up the estimation of the between-chain variance. (D for R².) I still wonder at a waste of simulations / computing power resulting from stopping the runs almost immediately after warm-up is over, since reaching the stationary regime or an approximation thereof should be exploited more efficiently. (Starting from a minimal discrepancy sample would also improve efficiency.)

Lu Zhang also talked on the issue of cutting down warmup, presenting a paper co-authored with Bob, Andrew, and Aki, recommending Laplace / variational approximations for reaching faster high-posterior-density regions, using an algorithm called Pathfinder that relies on ELBO checks to counter poor performances of Laplace approximations. In the spirit of the workshop, it could be profitable to further transform / push-forward the outcome by a transport map.

Yuling Yao (of stacking and Pareto smoothing fame!) gave an original and challenging (in a positive sense) talk on the many ways of bridging densities [linked with the remark he shared with me the day before] and their statistical significance. Questioning our usual reliance on arithmetic or geometric mixtures. Ignoring computational issues, selecting a bridging pattern sounds not different from choosing a parameterised family of embedding distributions. This new typology of models can then be endowed with properties that are more or less appealing. (Occurences of the Hyvärinen score and our mixtestin perspective in the talk!)

Miranda Holmes-Cerfon talked about MCMC on stratification (illustrated by this beautiful picture of nanoparticle random walks). Which means sampling under varying constraints and dimensions with associated densities under the respective Hausdorff measures. This sounds like a perfect setting for reversible jump and in a sense it is, as mentioned in the talks. Except that the moves between manifolds are driven by the proximity to said manifold, helping with a higher acceptance rate, and making the proposals easier to construct since projections (or the reverses) have a physical meaning. (But I could not tell from the talk why the approach was seemingly escaping the symmetry constraint set by Peter Green’s RJMCMC on the reciprocal moves between two given manifolds).

maximal couplings of the Metropolis-Hastings algorithm

Posted in Statistics, University life with tags , , , , , , , , , on November 17, 2020 by xi'an

As a sequel to their JRSS B paper, John O’Leary, Guanyang Wang, and [my friend, co-author and former student!] Pierre E. Jacob have recently posted a follow-up paper on maximal coupling for Metropolis-Hastings algorithms, where maximal is to be understood in terms of the largest possible probability for the coupled chains to be equal, according to the bound set by the coupling inequality. It made me realise that there is a heap of very recent works in this area.

A question that came up when reading the paper with our PhD students is whether or not the coupled chains stay identical after meeting once. When facing two different targets this seems inevitable and indeed Lemma 2 seems to show that no. A strong lemma that does not [need to] state what happens outside the diagonal Δ.

One of the essential tricks is to optimise several kinds of maximal coupling, incl. one for the Bernoullesque choice of moving, as given on p.3.

Algorithm 1 came as a novelty to me as it first seemed (to me!) the two chains may never meet, but this was before I read the small prints of the transition (proposal) kernel being maximally coupled with itself. While Algorithm 2 may be the earliest example of Metropolis-Hastings coupling I have seen, namely in 1999 in Crete, in connection with a talk by Laird Breyer and Gareth Roberts at a workshop of our ESSS network. As explained by the authors, this solution is not always a maximal coupling for the reason that

min(q¹.q²) min(α¹,α²) ≤ min(q¹α¹,q²α²)

(with q for the transition kernel and α for the acceptance probability). Lemma 1 is interesting in that it describes the probability to un-meet (!) as the surface between one of the move densities and the minimum of the two.

The first solution is to couple by plain Accept-Reject with the first chain being the proposed value and if rejected [i.e. not in C] to generate from the remainder or residual of the second target, in a form of completion of acceptance-rejection (accept when above rather than below, i.e. in A or A’). This can be shown to be a maximal coupling. Another coupling using reflection residuals works better but requires some spherical structure in the kernel. A further coupling on the acceptance of the Metropolis-Hastings move seems to bring an extra degree of improvement.

In the introduction, the alternatives about the acceptance probability α(·,·), e.g. Metropolis-Hastings versus Barker, are mentioned but would it make a difference to the preferred maximal coupling when using one or the other?

A further comment is that, in larger dimensions, I mean larger than one!, a Gibbsic form of coupling could be considered. In which case it would certainly decrease the coupling probability but may still speed up the overall convergence by coupling more often. See “maximality is sometimes less important than other properties of a coupling, such as the contraction behavior when a meeting does not occur.” (p.8)

As a final pun, I noted that Vaserstein is not a typo, as Leonid Vaseršteĭn is a Russian-American mathematician, currently at Penn State.

informed proposals for local MCMC in discrete spaces

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , on April 17, 2020 by xi'an

Last year Giacomo Zanella published a paper entitled informed proposals for local MCMC in discrete spaces in JASA. Which I had missed somehow and only discovered through another paper, and which we recently discussed at Paris-Dauphine with graduate students, marooned by COVID-19 . Probability targets in discrete spaces are intrinsically hard[er] to simulate in my opinion if only because there is no natural distance, hence no natural neighbourhood. A random walk proposal like the reference kernel in the paper is not directly calibrated. Without demarginalisation there is neither a clear version of calculus for implementing MALA or HMC. What indeed is HMC on a discrete space? If this requires “embedding the binary space in a continuous space”, it does not sound very enticing if the construct is context dependent.

“This would allow for more moves to be accepted and longer moves to be performed, thus improving the algorithm’s efficiency.”

A interesting aspect of the paper is that for near atomic transition kernels K, informally for small σ’s, the proposal switch to Q finds target x normalising constant as new stationary and close to the actual target. Which incidentally reminded me of our vanilla Rao-Blackwellisation with Randal Douc. This however begets the worry that it may prove unwieldy in continuous cases, as except for Gaussian kernels, the  proposal switch to Q may prove intractable and requires further MCMC steps, in a form of infinite regress. Plus a musing that, were the original kernel K to be replaced with the new Q, another informed proposal transform could be applied to Q. Further infinite regress…

“[The optimality of the Metropolis-Hastings choice of acceptance probability] does not translate to the context of balancing functions.”

The paper indeed exhibits a setting that is rehabilitating Barker’ (1965) version of the acceptance probability, but I never  was very much convinced there was a significant difference in using one or the other. During our virtual (?) discussion, we also wondered at the adaptive abilities of the approach, e.g., selecting among a finite family of g’s (according to which criterion) or parameterising g towards an optimal choice of its parameter. And at the capacity for Rao-Blackwellisation since the proposal have to consider the entire set of neighbours prior to moving to a likely one.

Monte Carlo calculations of the radial distribution functions for a proton-electron plasma

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , on October 11, 2017 by xi'an

“In conclusion, the Monte Carlo method of calculating radial distribution functions in a plasma is a feasible approach if significant computing time is available (…) The results indicate that at least 10000 iterations must be completed before the system can be considered near to its equilibrium state, and for a badly chosen starting configuration, the run would need to be considerably longer (…) for more conclusive results a longer run is needed so that the energy of the system can settle into an equilibrium pattern and steady-state radial distribution functions can be obtained.” A.A. Barker

Looking for the history behind Barker’s formula the other day made me look for the original 1965 paper. Which got published in the Australian Journal of Physics at the beginning of Barker’s PhD at the University of Adelaide.

As shown in the above screenshot, the basis  of Barker’s algorithm is indeed Barker’s acceptance probability, albeit written in a somewhat confusing way since the current value of the chain is kept if a Uniform variate is smaller than what is actually the rejection probability. No mistake there! And more interestingly, Barker refers to Wood and Parker (1957) for the “complete and rigorous theory” behind the method. (Both Wood and Parker being affiliated with Los Alamos Scientific Laboratory, while Barker acknowledges support from both the Australian Institute of Nuclear Science and Engineering and the Weapons Research Establishment, Salisbury… This were times when nuclear weapon research was driving MCMC. Hopefully we will not come back to such times. Or, on the pessimistic side, we will not have time to come back to such times!)

As in Metropolis et al. (1953), the analysis is made on a discretised (finite) space, building the Markov transition matrix, stating the detailed balance equation (called microscopic reversibility). Interestingly, while Barker acknowledges that there are other ways of assigning the transition probability, his is the “most rapid” in terms of mixing. And equally interestingly, he discusses the scale of the random walk in the [not-yet-called] Metropolis-within-Gibbs move as major, targetting 0.5 as the right acceptance rate, and suggesting to adapt this scale on the go. There is also a side issue that is apparently not processed with all due rigour, namely the fact that the particles in the system cannot get arbitrarily close from one another. It is unclear how a proposal falling below this distance is processed by Barker’s algorithm. When implemented on 32 particles, this algorithm took five hours to execute 6100 iterations. With a plot of the target energy function that does not shout convergence, far from it! As acknowledged by Barker himself (p.131).

The above quote is from the conclusion and its acceptance of the need for increased computing times comes as a sharp contrast with this week when one of our papers was rejected based on this very feature..!

Barker at the Bernoulli factory

Posted in Books, Statistics with tags , , , , , , , on October 5, 2017 by xi'an

Yesterday, Flavio Gonçalves, Krzysztof Latuszýnski, and Gareth Roberts (Warwick) arXived a paper on Barker’s algorithm for Bayesian inference with intractable likelihoods.

“…roughly speaking Barker’s method is at worst half as good as Metropolis-Hastings.”

Barker’s acceptance probability (1965) is a smooth if less efficient version of Metropolis-Hastings. (Barker wrote his thesis in Adelaide, in the Mathematical Physics department. Most likely, he never interacted with Ronald Fisher, who died there in 1962) This smoothness is exploited by devising a Bernoulli factory consisting in a 2-coin algorithm that manages to simulate the Bernoulli variable associated with the Barker probability, from a coin that can simulate Bernoulli’s with probabilities proportional to [bounded] π(θ). For instance, using a bounded unbiased estimator of the target. And another coin that simulates another Bernoulli on a remainder term. Assuming the bound on the estimate of π(θ) is known [or part of the remainder term]. This is a neat result in that it expands the range of pseudo-marginal methods (and resuscitates Barker’s formula from oblivion!). The paper includes an illustration in the case of the far-from-toyish Wright-Fisher diffusion. [Making Fisher and Barker meeting, in the end!]

%d bloggers like this: