Archive for Metropolis-Hastings algorithms

ergodicity of approximate MCMC chains with applications to large datasets

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , on August 31, 2015 by xi'an

bhamAnother arXived paper I read on my way to Warwick! And yet another paper written by my friend Natesh Pillai (and his co-author Aaron Smith, from Ottawa). The goal of the paper is to study the ergodicity and the degree of approximation of the true posterior distribution of approximate MCMC algorithms that recently flourished as an answer to “Big Data” issues… [Comments below are about the second version of this paper.] One of the most curious results in the paper is the fact that the approximation may prove better than the original kernel, in terms of computing costs! If asymptotically in the computing cost. There also are acknowledged connections with the approximative MCMC kernel of Pierre Alquier, Neal Friel, Richard Everitt and A Boland, briefly mentioned in an earlier post.

The paper starts with a fairly theoretical part, to follow with an application to austerity sampling [and, in the earlier version of the paper, to the Hoeffding bounds of Bardenet et al., both discussed earlier on the ‘Og, to exponential random graphs (the paper being rather terse on the description of the subsampling mechanism), to stochastic gradient Langevin dynamics (by Max Welling and Yee-Whye Teh), and to ABC-MCMC]. The assumptions are about the transition kernels of a reference Markov kernel and of one associated with the approximation, imposing some bounds on the Wasserstein distance between those kernels, K and K’. Results being generic, there is no constraint as to how K is chosen or on how K’ is derived from K. Except in Lemma 3.6 and in the application section, where the same proposal kernel L is used for both Metropolis-Hastings algorithms K and K’. While I understand this makes for an easier coupling of the kernels, this also sounds like a restriction to me in that modifying the target begs for a similar modification in the proposal, if only because the tails they are a-changin’

In the case of subsampling the likelihood to gain computation time (as discussed by Korattikara et al. and by Bardenet et al.), the austerity algorithm as described in Algorithm 2 is surprising as the average of the sampled data log-densities and the log-transform of the remainder of the Metropolis-Hastings probability, which seem unrelated, are compared until they are close enough.  I also find hard to derive from the different approximation theorems bounding exceedance probabilities a rule to decide on the subsampling rate as a function of the overall sample size and of the computing cost. (As a side if general remark, I remain somewhat reserved about the subsampling idea, given that it requires the entire dataset to be available at every iteration. This makes parallel implementations rather difficult to contemplate.)

an extension of nested sampling

Posted in Books, Statistics, University life with tags , , , , , , , on December 16, 2014 by xi'an

I was reading [in the Paris métro] Hastings-Metropolis algorithm on Markov chains for small-probability estimation, arXived a few weeks ago by François Bachoc, Lionel Lenôtre, and Achref Bachouch, when I came upon their first algorithm that reminded me much of nested sampling: the following was proposed by Guyader et al. in 2011,

To approximate a tail probability P(H(X)>h),

  • start from an iid sample of size N from the reference distribution;
  • at each iteration m, select the point x with the smallest H(x)=ξ and replace it with a new point y simulated under the constraint H(y)≥ξ;
  • stop when all points in the sample are such that H(X)>h;
  • take


as the unbiased estimator of P(H(X)>h).

Hence, except for the stopping rule, this is the same implementation as nested sampling. Furthermore, Guyader et al. (2011) also take advantage of the bested sampling fact that, if direct simulation under the constraint H(y)≥ξ is infeasible, simulating via one single step of a Metropolis-Hastings algorithm is as valid as direct simulation. (I could not access the paper, but the reference list of Guyader et al. (2011) includes both original papers by John Skilling, so the connection must be made in the paper.) What I find most interesting in this algorithm is that it even achieves unbiasedness (even in the MCMC case!).

another instance of ABC?

Posted in Statistics with tags , , , , , on December 2, 2014 by xi'an

“These characteristics are (1) likelihood is not available; (2) prior information is available; (3) a portion of the prior information is expressed in terms of functionals of the model that cannot be converted into an analytic prior on model parameters; (4) the model can be simulated. Our approach depends on an assumption that (5) an adequate statistical model for the data are available.”

A 2009 JASA paper by Ron Gallant and Rob McCulloch, entitled “On the Determination of General Scientific Models With Application to Asset Pricing”, may have or may not have connection with ABC, to wit the above quote, but I have trouble checking whether or not this is the case.

The true (scientific) model parametrised by θ is replaced with a (statistical) substitute that is available in closed form. And parametrised by g(θ). [If you can get access to the paper, I’d welcome opinions about Assumption 1 therein which states that the intractable density is equal to a closed-form density.] And the latter is over-parametrised when compared with the scientific model. As in, e.g., a N(θ,θ²) scientific model versus a N(μ,σ²) statistical model. In addition, the prior information is only available on θ. However, this does not seem to matter that much since (a) the Bayesian analysis is operated on θ only and (b) the Metropolis approach adopted by the authors involves simulating a massive number of pseudo-observations, given the current value of the parameter θ and the scientific model, so that the transform g(θ) can be estimated by maximum likelihood over the statistical model. The paper suggests using a secondary Markov chain algorithm to find this MLE. Which is claimed to be a simulated annealing resolution (p.121) although I do not see the temperature decreasing. The pseudo-model is then used in a primary MCMC step.

Hence, not truly an ABC algorithm. In the same setting, ABC would use a simulated dataset the same size as the observed dataset, compute the MLEs for both and compare them. Faster if less accurate when Assumption 1 [that the statistical model holds for a restricted parametrisation] does not stand.

Another interesting aspect of the paper is about creating and using a prior distribution around the manifold η=g(θ). This clearly relates to my earlier query about simulating on measure zero sets. The paper does not bring a definitive answer, as it never simulates exactly on the manifold, but this constitutes another entry on this challenging problem…

efficient exploration of multi-modal posterior distributions

Posted in Books, Statistics, University life with tags , , , , on September 1, 2014 by xi'an

The title of this recent arXival had potential appeal, however the proposal ends up being rather straightforward and hence  anti-climactic! The paper by Hu, Hendry and Heng proposes to run a mixture of proposals centred at the various modes of  the target for an efficient exploration. This is a correct MCMC algorithm, granted!, but the requirement to know beforehand all the modes to be explored is self-defeating, since the major issue with MCMC is about modes that are  omitted from the exploration and remain undetected throughout the simulation… As provided, this is a standard MCMC algorithm with no adaptive feature and I would rather suggest our population Monte Carlo version, given the available information. Another connection with population Monte Carlo is that I think the performances would improve by Rao-Blackwellising the acceptance rate, i.e. removing the conditioning on the (ancillary) component of the index. For PMC we proved that using the mixture proposal in the ratio led to an ideally minimal variance estimate and I do not see why randomising the acceptance ratio in the current case would bring any improvement.

understanding the Hastings algorithm

Posted in Books, Statistics with tags , , , , , on August 26, 2014 by xi'an

David Minh and Paul Minh [who wrote a 2001 Applied Probability Models] have recently arXived a paper on “understanding the Hastings algorithm”. They revert to the form of the acceptance probability suggested by Hastings (1970):

\rho(x,y) = s(x,y) \left(1+\dfrac{\pi(x) q(y|x)}{\pi(y) q(x|y)}\right)^{-1}

where s(x,y) is a symmetric function keeping the above between 0 and 1, and q is the proposal. This obviously includes the standard Metropolis-Hastings form of the ratio, as well as Barker’s (1965):

\rho(x,y) = \left(1+\dfrac{\pi(x) q(y|x)}{\pi(y) q(x|y)}\right)^{-1}

which is known to be less efficient by accepting less often (see, e.g., Antonietta Mira’s PhD thesis). The authors also consider the alternative

\rho(x,y) = \min(\pi(y)/ q(y|x),1)\,\min(q(x|y)/\pi(x),1)

which I had not seen earlier. It is a rather intriguing quantity in that it can be interpreted as (a) a simulation of y from the cutoff target corrected by reweighing the previous x into a simulation from q(x|y); (b) a sequence of two acceptance-rejection steps, each concerned with a correspondence between target and proposal for x or y. There is an obvious caveat in this representation when the target is unnormalised since the ratio may then be arbitrarily small… Yet another alternative could be proposed in this framework, namely the delayed acceptance probability of our paper with Marco and Clara, one special case being

\rho(x,y) = \min(\pi_1(y)q(x|y)/\pi_1(x) q(y|x),1)\,\min(\pi_2(y)/\pi_1(x),1)



is an arbitrary decomposition of the target. An interesting remark in the paper is that any Hastings representation can alternatively be written as

\rho(x,y) = \min(\pi(y)/k(x,y)q(y|x),1)\,\min(k(x,y)q(x|y)/\pi(x),1)

where k(x,y) is a (positive) symmetric function. Hence every single Metropolis-Hastings is also a delayed acceptance in the sense that it can be interpreted as a two-stage decision.

The second part of the paper considers an extension of the accept-reject algorithm where a value y proposed from a density q(y) is accepted with probability

\min(\pi(y)/ Mq(y),1)

and else the current x is repeated, where M is an arbitrary constant (incl. of course the case where it is a proper constant for the original accept-reject algorithm). Curiouser and curiouser, as Alice would say! While I think I have read some similar proposal in the past, I am a wee intrigued at the appear of using only the proposed quantity y to decide about acceptance, since it does not provide the benefit of avoiding generations that are rejected. In this sense, it appears as the opposite of our vanilla Rao-Blackwellisation. (The paper however considers the symmetric version called the independent Markovian minorizing algorithm that only depends on the current x.) In the extension to proposals that depend on the current value x, the authors establish that this Markovian AR is in fine equivalent to the generic Hastings algorithm, hence providing an interpretation of the “mysterious” s(x,y) through a local maximising “constant” M(x,y). A possibly missing section in the paper is the comparison of the alternatives, albeit the authors mention Peskun’s (1973) result that exhibits the Metropolis-Hastings form as the optimum.

a day for comments

Posted in Mountains, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , , , , , , , on April 21, 2014 by xi'an

As I was flying over Skye (with [maybe] a first if hazy perspective on the Cuillin ridge!) to Iceland, three long sets of replies to some of my posts appeared on the ‘Og:

Thanks to them for taking the time to answer my musings…


adaptive subsampling for MCMC

Posted in pictures, Statistics, Travel with tags , , , , , , , , , , , on April 15, 2014 by xi'an

Oxford to Coventry, Feb. 25, 2012

“At equilibrium, we thus should not expect gains of several orders of magnitude.”

As was signaled to me several times during the MCqMC conference in Leuven, Rémi Bardenet, Arnaud Doucet and Chris Holmes (all from Oxford) just wrote a short paper for the proceedings of ICML on a way to speed up Metropolis-Hastings by reducing the number of terms one computes in the likelihood ratio involved in the acceptance probability, i.e.


The observations appearing in this likelihood ratio are a random subsample from the original sample. Even though this leads to an unbiased estimator of the true log-likelihood sum, this approach is not justified on a pseudo-marginal basis à la Andrieu-Roberts (2009). (Writing this in the train back to Paris, I am not convinced this approach is in fact applicable to this proposal as the likelihood itself is not estimated in an unbiased manner…)

In the paper, the quality of the approximation is evaluated by Hoeffding’s like inequalities, which serves as the basis for a stopping rule on the number of terms eventually evaluated in the random subsample. In fine, the method uses a sequential procedure to determine if enough terms are used to take the decision and the probability to take the same decision as with the whole sample is bounded from below. The sequential nature of the algorithm requires to either recompute the vector of likelihood terms for the previous value of the parameter or to store all of them for deriving the partial ratios. While the authors adress the issue of self-evaluating whether or not this complication is worth the effort, I wonder (from my train seat) why they focus so much on recovering the same decision as with the complete likelihood ratio and the same uniform. It would suffice to get the same distribution for the decision (an alternative that is easier to propose than to create of course). I also (idly) wonder if a Gibbs version would be manageable, i.e. by changing only some terms in the likelihood ratio at each iteration, in which case the method could be exact… (I found the above quote quite relevant as, in an alternative technique we are constructing with Marco Banterle, the speedup is particularly visible in the warmup stage.) Hence another direction in this recent flow of papers attempting to speed up MCMC methods against the incoming tsunami of “Big Data” problems.


Get every new post delivered to your Inbox.

Join 945 other followers