informed proposals for local MCMC in discrete spaces

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , on April 17, 2020 by xi'an

Last year Giacomo Zanella published a paper entitled informed proposals for local MCMC in discrete spaces in JASA. Which I had missed somehow and only discovered through another paper, and which we recently discussed at Paris-Dauphine with graduate students, marooned by COVID-19 . Probability targets in discrete spaces are intrinsically hard[er] to simulate in my opinion if only because there is no natural distance, hence no natural neighbourhood. A random walk proposal like the reference kernel in the paper is not directly calibrated. Without demarginalisation there is neither a clear version of calculus for implementing MALA or HMC. What indeed is HMC on a discrete space? If this requires “embedding the binary space in a continuous space”, it does not sound very enticing if the construct is context dependent.

“This would allow for more moves to be accepted and longer moves to be performed, thus improving the algorithm’s efficiency.”

A interesting aspect of the paper is that for near atomic transition kernels K, informally for small σ’s, the proposal switch to Q finds target x normalising constant as new stationary and close to the actual target. Which incidentally reminded me of our vanilla Rao-Blackwellisation with Randal Douc. This however begets the worry that it may prove unwieldy in continuous cases, as except for Gaussian kernels, the  proposal switch to Q may prove intractable and requires further MCMC steps, in a form of infinite regress. Plus a musing that, were the original kernel K to be replaced with the new Q, another informed proposal transform could be applied to Q. Further infinite regress…

“[The optimality of the Metropolis-Hastings choice of acceptance probability] does not translate to the context of balancing functions.”

The paper indeed exhibits a setting that is rehabilitating Barker’ (1965) version of the acceptance probability, but I never  was very much convinced there was a significant difference in using one or the other. During our virtual (?) discussion, we also wondered at the adaptive abilities of the approach, e.g., selecting among a finite family of g’s (according to which criterion) or parameterising g towards an optimal choice of its parameter. And at the capacity for Rao-Blackwellisation since the proposal have to consider the entire set of neighbours prior to moving to a likely one.

thinning a Markov chain, statistically

Posted in Books, pictures, R, Statistics with tags , , , , , , on June 13, 2017 by xi'an

Art Owen has arXived a new version of his thinning MCMC paper, where he studies how thinning or subsampling can improve computing time in MCMC chains. I remember quite well the message set by Mark Berliner and Steve MacEachern in an early 1990’s paper that subsampling was always increasing the variance of the resulting estimators. We actually have this result in our Monte Carlo Statistical Methods book. Now, there are other perspectives on this, as for instance cases when thinning can be hard-wired by simulating directly a k-step move, delaying rejection or acceptance, prefetching, or simulating directly the accepted values as in our vanilla Rao-Blackwellisation approach. Here, Art considers the case when there is a cost θ of computing a transform of the simulation [when the transition cost a unit] and when those transforms are positively correlated with correlation ρ. Somewhat unsurprisingly, when θ is large enough, thinning becomes worth implementing. But requires extra computations in evaluating the correlation ρ and the cost θ, which is rarely comparable with the cost of computing the likelihood itself, a requirement for the Metropolis-Hastings or Hamiltonian Monte Carlo step(s).  Subsampling while keeping the right target (which is a hard constraint!) should thus have a much more effective impact on computing budgets.

MCMskv #4 [house with a vision]

Posted in Statistics with tags , , , , , , , , , , , , on January 9, 2016 by xi'an

Last day at MCMskv! Not yet exhausted by this exciting conference, but this was the toughest day with one more session and a tutorial by Art Own on quasi Monte-Carlo. (Not even mentioning the night activities that I skipped. Or the ski break that I did not even consider.) Krys Latunszynski started with a plenary on exact methods for discretised diffusions, with a foray in Bernoulli factory problems. Then a neat session on adaptive MCMC methods that contained a talk by Chris Sherlock on delayed acceptance, where the approximation to the target was built by knn trees. (The adaptation was through the construction of the tree by including additional evaluations of the target density. Another paper sitting in my to-read list for too a long while: the exploitation of the observed values of π towards improving an MCMC sampler has always be “obvious” to me even though I could not see any practical way of doing so. )

It was wonderful that Art Owen accepted to deliver a tutorial at MCMskv on quasi-random Monte Carlo. Great tutorial, with a neat coverage of the issues most related to Monte Carlo integration. Since quasi-random sequences have trouble with accept/reject methods, a not-even-half-baked idea that came to me during Art’s tutorial was that the increased computing power granted by qMC could lead to a generic integration of the Metropolis-Hastings step in a Rao-Blackwellised manner. Art mentioned he was hoping that in a near future one could switch between pseudo- and quasi-random in an almost automated manner when running standard platforms like R. This would indeed be great, especially since quasi-random sequences seem to be available at the same cost as their pseudo-random counterpart. During the following qMC session, Art discussed the construction of optimal sequences on sets other than hypercubes (with the surprising feature that projecting optimal sequences from the hypercube does not work). Mathieu Gerber presented the quasi-random simulated annealing algorithm he developed with Luke Bornn that I briefly discussed a while ago. Or thought I did as I cannot trace a post on that paper! While the fact that annealing also works with quasi-random sequences is not astounding, the gain over random sequences shown on two examples is clear. The session also had a talk by Lester Mckey who relies Stein’s discrepancy to measure the value of an approximation to the true target. This was quite novel, with a surprising connection to Chris Oates’ talk and the use of score-based control variates, if used in a dual approach.

Another great session was the noisy MCMC one organised by Paul Jenkins (Warwick), with again a coherent presentation of views on the quality or lack thereof of noisy (or inexact) versions, with an update from Richard Everitt on inexact MCMC, Felipe Medina Aguayo (Warwick) on sufficient conditions for noisy versions to converge (and counterexamples), Jere Koskela (Warwick) on a pseudo-likelihood approach to the highly complex Kingman’s coalescent model in population genetics (of ABC fame!), and Rémi Bardenet on the tall data approximations techniques discussed in a recent post. Having seen or read most of those results previously did not diminish the appeal of the session.

locally weighted MCMC

Posted in Books, Statistics, University life with tags , , , , , , , , on July 16, 2015 by xi'an

Last week, on arXiv, Espen Bernton, Shihao Yang, Yang Chen, Neil Shephard, and Jun Liu (all from Harvard) proposed a weighting scheme to associated MCMC simulations, in connection with the parallel MCMC of Ben Calderhead discussed earlier on the ‘Og. The weight attached to each proposal is either the acceptance probability itself (with the rejection probability being attached to the current value of the MCMC chain) or a renormalised version of the joint target x proposal, either forward or backward. Both solutions are unbiased in that they have the same expectation as the original MCMC average, being some sort of conditional expectation. The proof of domination in the paper builds upon Calderhead’s formalism.

This work reminded me of several reweighting proposals we made over the years, from the global Rao-Blackwellisation strategy with George Casella, to the vanilla Rao-Blackwellisation solution we wrote with Randal Douc a few years ago, both of whom also are demonstrably improving upon the standard MCMC average. By similarly recycling proposed but rejected values. Or by diminishing the variability due to the uniform draw. The slightly parallel nature of the approach also connects with our parallel MCM version with Pierre Jacob (now Harvard as well!) and Murray Smith (who now leaves in Melbourne, hence the otherwise unrelated picture).

parallelising MCMC algorithms

Posted in Books, Statistics, University life with tags , , , on December 23, 2014 by xi'an

This paper, A general construction for parallelizing Metropolis-Hastings algorithms, written by Ben Calderhead, was first presented at MCMSki last January and has now appeared in PNAS. It is somewhat related to the recycling idea of Tjelmeland (2004, unpublished) and hence to our 1996 Rao-Blackwellisation paper with George. Although there is no recycling herein.

At each iteration of Ben’s algorithm, N proposed values are generated conditional on the “current” value of the Markov chain, which actually consists of (N+1) components and from which one component is drawn at random to serve as a seed for the next proposal distribution and the simulation of N other values. In short, this is a data-augmentation scheme with the index I on the one side and the N modified components on the other side. The neat trick in the proposal [and the reason for the jump in efficiency] is that the stationary distribution of the auxiliary variable can be determined and hence used (N+1) times in updating the vector of (N+1) components. (Note that picking the index at random means computing all (N+1) possible transitions from one component to the N others. Or even all (N+1)! if the proposals differ. Hence a potential increase in the computing cost, even though what costs the most is usually the likelihood computation, dispatched on the parallel processors.) While there are (N+1) terms involved at each step, the genuine Markov chain is truly over a single chain and the N other proposed values are not recycled. Even though they could be [for Monte Carlo integration purposes], as shown e.g. in our paper with Pierre Jacob and Murray Smith. Something that took a few iterations for me to understand is why Ben rephrases the original Metropolis-Hastings algorithm as a finite state space Markov chain on the set of indices {1,…,N+1} (Proposition 1). Conditionally on the values of the (N+1) vector, the stationary of that sub-chain is no longer uniform. Hence, picking (N+1) indices from the stationary helps in selecting the most appropriate images, which explains why the rejection rate decreases.

The paper indeed evaluates the impact of increasing the number of proposals in terms of effective sample size (ESS), acceptance rate, and mean squared jump distance, based two examples. As often in parallel implementations, the paper suggests an “N-fold increase in computational speed” even though this is simply the effect of running the same algorithm on a single processor and on N parallel processors. If the comparison is between a single proposal Metropolis-Hastings algorithm on a single processor and an N-fold proposal on N processors, I would say the latter is slower because of the selection of the index I that forces all pairs of reverse move.  Nonetheless, since this is an almost free bonus resulting from using N processors, when compared with more complex coupled chains, it sounds worth investigating and comparing with those more complex parallel schemes.

understanding the Hastings algorithm

Posted in Books, Statistics with tags , , , , , on August 26, 2014 by xi'an

David Minh and Paul Minh [who wrote a 2001 Applied Probability Models] have recently arXived a paper on “understanding the Hastings algorithm”. They revert to the form of the acceptance probability suggested by Hastings (1970):

$\rho(x,y) = s(x,y) \left(1+\dfrac{\pi(x) q(y|x)}{\pi(y) q(x|y)}\right)^{-1}$

where s(x,y) is a symmetric function keeping the above between 0 and 1, and q is the proposal. This obviously includes the standard Metropolis-Hastings form of the ratio, as well as Barker’s (1965):

$\rho(x,y) = \left(1+\dfrac{\pi(x) q(y|x)}{\pi(y) q(x|y)}\right)^{-1}$

which is known to be less efficient by accepting less often (see, e.g., Antonietta Mira’s PhD thesis). The authors also consider the alternative

$\rho(x,y) = \min(\pi(y)/ q(y|x),1)\,\min(q(x|y)/\pi(x),1)$

which I had not seen earlier. It is a rather intriguing quantity in that it can be interpreted as (a) a simulation of y from the cutoff target corrected by reweighing the previous x into a simulation from q(x|y); (b) a sequence of two acceptance-rejection steps, each concerned with a correspondence between target and proposal for x or y. There is an obvious caveat in this representation when the target is unnormalised since the ratio may then be arbitrarily small… Yet another alternative could be proposed in this framework, namely the delayed acceptance probability of our paper with Marco and Clara, one special case being

$\rho(x,y) = \min(\pi_1(y)q(x|y)/\pi_1(x) q(y|x),1)\,\min(\pi_2(y)/\pi_1(x),1)$

where

$\pi(x)\propto\pi_1(x)\pi_2(x)$

is an arbitrary decomposition of the target. An interesting remark in the paper is that any Hastings representation can alternatively be written as

$\rho(x,y) = \min(\pi(y)/k(x,y)q(y|x),1)\,\min(k(x,y)q(x|y)/\pi(x),1)$

where k(x,y) is a (positive) symmetric function. Hence every single Metropolis-Hastings is also a delayed acceptance in the sense that it can be interpreted as a two-stage decision.

The second part of the paper considers an extension of the accept-reject algorithm where a value y proposed from a density q(y) is accepted with probability

$\min(\pi(y)/ Mq(y),1)$

and else the current x is repeated, where M is an arbitrary constant (incl. of course the case where it is a proper constant for the original accept-reject algorithm). Curiouser and curiouser, as Alice would say! While I think I have read some similar proposal in the past, I am a wee intrigued at the appear of using only the proposed quantity y to decide about acceptance, since it does not provide the benefit of avoiding generations that are rejected. In this sense, it appears as the opposite of our vanilla Rao-Blackwellisation. (The paper however considers the symmetric version called the independent Markovian minorizing algorithm that only depends on the current x.) In the extension to proposals that depend on the current value x, the authors establish that this Markovian AR is in fine equivalent to the generic Hastings algorithm, hence providing an interpretation of the “mysterious” s(x,y) through a local maximising “constant” M(x,y). A possibly missing section in the paper is the comparison of the alternatives, albeit the authors mention Peskun’s (1973) result that exhibits the Metropolis-Hastings form as the optimum.

non-negative unbiased estimators

Posted in Books, Kids, Statistics, University life with tags , , , , , on October 3, 2013 by xi'an

Pierre Jacob and Alexandre Thiéry just arXived a highly pertinent paper on the most debated issue of non-negative unbiased estimators (of positive quantities). If you remember that earlier post of mine, I mentioned the issue in connection with the Russian roulette estimator(s) of Mark Girolami et al. And, as Pierre and Alexandre point out in the paper, there is also a clear and direct connection with the Bernoulli factory problem. And with our Vanilla Rao-Blackwellisation technique (sadly overlooked, once more!).

The first thing I learned from the paper is how to turn a converging sequence into an unbiased estimator. If (En) is this converging sequence, with limit μ, then

$\sum_{n=0}^N (E_n-E_{n-1}) / \mathbb{P}(N\ge n)$

is unbiased..! Amazing. Even though the choice of the distribution of N matters towards getting a finite variance estimator, this transform is simply amazing. (Of course, once one looks at it, one realises it is the “old” trick of turning a series into a sequence and vice-versa. Still…!) And then you can reuse it into getting an unbiased estimator for almost any transform of μ.

The second novel thing in the paper is the characterisation of impossible cases for non-negative unbiased estimators. For instance, if the original sequence has an unbounded support, there cannot be such an estimator. If the support is an half-line, the transform must be monotonous monotonic. If the support is a bounded interval (a,b), then the transform must be bounded from below by a polynomial bound

$\epsilon\,\min\{(x-a)^m,(b-x)^n\}$

(where the extra-parameters obviously relate to the transform). (In this later case, the authors also show how to derive a Bernoulli estimator from the original unbiased estimator.)