Archive for auxiliary variable

estimating constants [survey]

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , on February 2, 2017 by xi'an

A new survey on Bayesian inference with intractable normalising constants was posted on arXiv yesterday by Jaewoo Park and Murali Haran. A rather massive work of 58 pages, almost handy for a short course on the topic! In particular, it goes through the most common MCMC methods with a detailed description, followed by comments on components to be calibrated and the potential theoretical backup. This includes for instance the method of Liang et al. (2016) that I reviewed a few months ago. As well as the Wang-Landau technique we proposed with Yves Atchadé and Nicolas Lartillot. And the noisy MCMC of Alquier et al. (2016), also reviewed a few months ago. (The Russian Roulette solution is only mentioned very briefly as” computationally very expensive”. But still used in some illustrations. The whole area of pseudo-marginal MCMC is also missing from the picture.)

“…auxiliary variable approaches tend to be more efficient than likelihood approximation approaches, though efficiencies vary quite a bit…”

The authors distinguish between MCMC methods where the normalizing constant is approximated and those where it is omitted by an auxiliary representation. The survey also distinguishes between asymptotically exact and asymptotically inexact solutions. For instance, using a finite number of MCMC steps instead of the associated target results in an asymptotically inexact method. The question that remains open is what to do with the output, i.e., whether or not there is a way to correct for this error. In the illustration for the Ising model, the double Metropolis-Hastings version of Liang et al. (2010) achieves for instance massive computational gains, but also exhibits a persistent bias that would go undetected were it the sole method implemented. This aspect of approximate inference is not really explored in the paper, but constitutes a major issue for modern statistics (and machine learning as well, when inference is taken into account.)

In conclusion, this survey provides a serious exploration of recent MCMC methods. It begs for a second part involving particle filters, which have often proven to be faster and more efficient than MCMC methods, at least in state space models. In that regard, Nicolas Chopin and James Ridgway examined further techniques when calling to leave the Pima Indians [dataset] alone.

auxiliary variable methods as ABC

Posted in Books, pictures, Statistics, University life with tags , , , , , on May 9, 2016 by xi'an

ruins of the abbey at Tynemouth, Sept. 03, 2013Dennis Prangle and Richard Everitt arXived a note today where they point out the identity between the auxiliary variable approach of Møller et al. (2006) [or rather its multiple or annealed version à la Murray] and [exact] ABC (as in our 2009 paper) in the case of Markov random fields. The connection between the two appears when using an importance sampling step in the ABC algorithm and running a Markov chain forward and backward the same number of steps as there are levels in the annealing scheme of MAV. Maybe more a curiosity than an indicator of a large phenomenon, since it is so rare that ABC can be use in its exact form.

Bayesian model comparison with intractable constants

Posted in Books, Kids, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , on February 8, 2016 by xi'an

abcIRichard Everitt, Adam Johansen (Warwick), Ellen Rowing and Melina Evdemon-Hogan have updated [on arXiv] a survey paper on the computation of Bayes factors in the presence of intractable normalising constants. Apparently destined for Statistics and Computing when considering the style. A great entry, in particular for those attending the CRiSM workshop Estimating Constants in a few months!

A question that came to me from reading the introduction to the paper is why a method like Møller et al.’s (2006) auxiliary variable trick should be considered more “exact” than the pseudo-marginal approach of Andrieu and Roberts (2009) since the later can equally be seen as an auxiliary variable approach. The answer was on the next page (!) as it is indeed a special case of Andrieu and Roberts (2009). Murray et al. (2006) also belongs to this group with a product-type importance sampling estimator, based on a sequence of tempered intermediaries… As noted by the authors, there is a whole spectrum of related methods in this area, some of which qualify as exact-approximate, inexact approximate and noisy versions.

Their main argument is to support importance sampling as the method of choice, including sequential Monte Carlo (SMC) for large dimensional parameters. The auxiliary variable of Møller et al.’s (2006) is then part of the importance scheme. In the first toy example, a Poisson is opposed to a Geometric distribution, as in our ABC model choice papers, for which a multiple auxiliary variable approach dominates both ABC and Simon Wood’s synthetic likelihood for a given computing cost. I did not spot which artificial choice was made for the Z(θ)’s in both models, since the constants are entirely known in those densities. A very interesting section of the paper is when envisioning biased approximations to the intractable density. If only because the importance weights are most often biased due to the renormalisation (possibly by resampling). And because the variance derivations are then intractable as well. However, due to this intractability, the paper can only approach the impact of those approximations via empirical experiments. This leads however to the interrogation on how to evaluate the validity of the approximation in settings where truth and even its magnitude are unknown… Cross-validation and bootstrap type evaluations may prove too costly in realistic problems. Using biased solutions thus mostly remains an open problem in my opinion.

The SMC part in the paper is equally interesting if only because it focuses on the data thinning idea studied by Chopin (2002) and many other papers in the recent years. This made me wonder why an alternative relying on a sequence of approximations to the target with tractable normalising constants could not be considered. A whole sequence of auxiliary variable completions sounds highly demanding in terms of computing budget and also requires a corresponding sequence of calibrations. (Now, ABC fares no better since it requires heavy simulations and repeated calibrations, while further exhibiting a damning missing link with the target density. ) Unfortunately, embarking upon a theoretical exploration of the properties of approximate SMC is quite difficult, as shown by the strong assumptions made in the paper to bound the total variation distance to the true target.

love-hate Metropolis algorithm

Posted in Books, pictures, R, Statistics, Travel with tags , , , , , , , , , on January 28, 2016 by xi'an

Hyungsuk Tak, Xiao-Li Meng and David van Dyk just arXived a paper on a multiple choice proposal in Metropolis-Hastings algorithms towards dealing with multimodal targets. Called “A repulsive-attractive Metropolis algorithm for multimodality” [although I wonder why XXL did not jump at the opportunity to use the “love-hate” denomination!]. The proposal distribution includes a [forced] downward Metropolis-Hastings move that uses the inverse of the target density π as its own target, namely 1/{π(x)+ε}. Followed by a [forced] Metropolis-Hastings upward move which target is {π(x)+ε}. The +ε is just there to avoid handling ratios of zeroes (although I wonder why using the convention 0/0=1 would not work). And chosen as 10⁻³²³ by default in connection with R smallest positive number. Whether or not the “downward” move is truly downwards and the “upward” move is truly upwards obviously depends on the generating distribution: I find it rather surprising that the authors consider the same random walk density in both cases as I would have imagined relying on a more dispersed distribution for the downward move in order to reach more easily other modes. For instance, the downward move could have been based on an anti-Langevin proposal, relying on the gradient to proceed further down…

This special choice of a single proposal however simplifies the acceptance ratio (and keeps the overall proposal symmetric). The final acceptance ratio still requires a ratio of intractable normalising constants that the authors bypass by Møller et al. (2006) auxiliary variable trick. While the authors mention the alternative pseudo-marginal approach of Andrieu and Roberts (2009), they do not try to implement it, although this would be straightforward here: since the normalising constants are the probabilities of accepting a downward and an upward move, respectively. Those can easily be evaluated at a cost similar to the use of the auxiliary variables. That is,

– generate a few moves from the current value and record the proportion p of accepted downward moves;
– generate a few moves from the final proposed value and record the proportion q of accepted downward moves;

and replace the ratio of intractable normalising constants with p/q. It is not even clear that one needs those extra moves since the algorithm requires an acceptance in the downward and upward moves, hence generate Geometric variates associated with those probabilities p and q, variates that can be used for estimating them. From a theoretical perspective, I also wonder if forcing the downward and upward moves truly leads to an improved convergence speed. Considering the case when the random walk is poorly calibrated for either the downward or upward move, the number of failed attempts before an acceptance may get beyond the reasonable.

As XXL and David pointed out to me, the unusual aspect of the approach is that here the proposal density is intractable, rather than the target density itself. This makes using Andrieu and Roberts (2009) seemingly less straightforward. However, as I was reminded this afternoon at the statistics and probability seminar in Bristol, the argument for the pseudo-marginal based on an unbiased estimator is that w Q(w|x) has a marginal in x equal to π(x) when the expectation of w is π(x). In thecurrent problem, the proposal in x can extended into a proposal in (x,w), w P(w|x) whose marginal is the proposal on x.

If we complement the target π(x) with the conditional P(w|x), the acceptance probability would then involve

{π(x’) P(w’|x’) / π(x) P(w|x)} / {w’ P(w’|x’) / w P(w|x)} = {π(x’) / π(x)} {w/w’}

so it seems the pseudo-marginal (or auxiliary variable) argument also extends to the proposal. Here is a short experiment that shows no discrepancy between target and histogram:

#love-hate move
  while (runif(1)>{pi(x)+nozero}/{pi(prop1)+nozero}){ 
  while (runif(1)>{pi(prop2)+nozero}/{pi(prop1)+nozero}) 
  if (runif(1)<pi(prop2)*bacwa/pi(x)/fowa){ 
#arbitrary bimodal target
#running the chain
fowa=1;prop1=rnorm(1,x,2) #initial estimate
  while (runif(1)>{pi(x)+nozero}/{pi(prop1)+nozero}){
for (t in 2:T)

pseudo slice sampling

Posted in Books, Statistics, University life with tags , , , , , on November 26, 2015 by xi'an

The workshop in Warwick last week made me aware of (yet) another arXiv posting I had missed: Pseudo-marginal slice sampling by Iain Murray and Matthew Graham. The idea is to mix the pseudo-marginal approach of Andrieu and Roberts (2009) with a noisy slice sampling scheme à la Neal (2003). The auxiliary random variable u used in the (pseudo-marginal) unbiased estimator of the target I(θ), Î(θ,u), and with distribution q(u) is merged with the random variable of interest so that the joint is


and a Metropolis-Hastings proposal on that target simulating from k(θ,θ’)q(u’) [meaning the auxiliary is simulated independently] recovers the pseudo-marginal Metropolis-Hastings ratio


(which is a nice alternative proof that the method works!). The novel idea in the paper is that the proposal on the auxiliary u can be of a different form, while remaining manageable. For instance, as a two-block Gibbs sampler. Or an elliptical slice sampler for the u component. The argument being that an independent update of u may lead the joint chain to get stuck. Among the illustrations in the paper, an Ising model (with no phase transition issue?) and a Gaussian process applied to the Pima Indian data set (despite a recent prohibition!). From the final discussion, I gather that the modification should be applicable to every (?) case when a pseudo-marginal approach is available, since the auxiliary distribution q(u) is treated as a black box. Quite an interesting read and proposal!

data augmentation with divergence

Posted in Books, Kids, Statistics, University life with tags , , , , , on November 18, 2015 by xi'an

Another (!) Cross Validated question that shed some light on the difficulties of explaining the convergence of MCMC algorithms. Or in understanding conditioning and hierarchical models. The author wanted to know why a data augmentation of his did not converge: In a simplified setting, given an observation y that he wrote as y=h(x,θ), he had built a Gibbs sampler by reconstructing x=g(y,θ) and simulating θ given x: at each iteration t,

  1. compute xt=g(y,θt-1)
  2. simulate θt~π(θ|xt)

and he attributed the lack of convergence to a possible difficulty with the Jacobian. My own interpretation of the issue was rather that condition on the unobserved x was not the same as conditioning on the observed y and hence that y was missing from step 2. And that the simulation of x is useless. Unless one uses it in an augmented scheme à la Xiao-Li… Nonetheless, I like the problem, if only because my very first reaction was to draw a hierarchical dependence graph and to conclude this should be correct, before checking on a toy example that it was not!

Hamming Ball Sampler

Posted in Books, Statistics, University life with tags , , , , , on May 7, 2015 by xi'an

yauMichalis Titsias and Christopher Yau just arXived a paper entitled the Hamming Ball sampler. Aimed at large and complex discrete latent variable models. The completion method is called after Richard Hamming, who is associated with code correcting methods (reminding me of one of the Master courses I took on coding, 30 years ago…), because it uses the Hamming distance in a discrete version of the slice sampler. One of the reasons for this proposal is that conditioning upon the auxiliary slice variable allows for the derivation of normalisation constants otherwise unavailable. The method still needs some calibration in the choice of blocks that partition the auxiliary variable and in the size of the ball. One of the examples assessed in the paper is a variable selection problem with 1200 covariates, out of which only 2 are relevant, while another example deals with a factorial HMM, involving 10 hidden chains. Since the paper compares each example with the corresponding block Gibbs sampling solution, it means this Gibbs sampling version is not intractable. It would be interesting to see a case where the alternative is not available…