## Markov chain importance sampling

Posted in Books, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , on May 31, 2018 by xi'an

Ingmar Schuster (formerly a postdoc at Dauphine and now in Freie Universität Berlin) and Ilja Klebanov (from Berlin) have recently arXived a paper on recycling proposed values in [a rather large class of] Metropolis-Hastings and unadjusted Langevin algorithms. This means using the proposed variates of one of these algorithms as in an importance sampler, with an importance weight going from the target over the (fully conditional) proposal to the target over the marginal stationary target. In the Metropolis-Hastings case, since the later is not available in most setups, the authors suggest using a Rao-Blackwellised nonparametric estimate based on the entire MCMC chain. Or a subset.

“Our estimator refutes the folk theorem that it is hard to estimate [the normalising constant] with mainstream Monte Carlo methods such as Metropolis-Hastings.”

The paper thus brings an interesting focus on the proposed values, rather than on the original Markov chain,  which naturally brings back to mind the derivation of the joint distribution of these proposed values we made in our (1996) Rao-Blackwellisation paper with George Casella. Where we considered a parametric and non-asymptotic version of this distribution, which brings a guaranteed improvement to MCMC (Metropolis-Hastings) estimates of integrals. In subsequent papers with George, we tried to quantify this improvement and to compare different importance samplers based on some importance sampling corrections, but as far as I remember, we only got partial results along this way, and did not cover the special case of the normalising constant Þ… Normalising constants did not seem such a pressing issue at that time, I figure. (A Monte Carlo 101 question: how can we be certain the importance sampler offers a finite variance?)

I think this is interesting future work. My intuition is that for Metropolis-Hastings importance sampling with random walk proposals, the variance is guaranteed to be finite because the importance distribution ρ_θ is a convolution of your target ρ with the random walk kernel q. This guarantees that the tails of ρ_θ are no lighter than those of ρ. What other forms of q mean for the tails of ρ_θ I have less intuition about.

When considering the Langevin alternative with transition (4), I was first confused and thought it was incorrect for moving from one value of Y (proposal) to the next. But that’s what unadjusted means in “unadjusted Langevin”! As pointed out in the early Langevin literature, e.g., by Gareth Roberts and Richard Tweedie, using a discretised Langevin diffusion in an MCMC framework means there is a risk of non-stationarity & non-ergodicity. Obviously, the corrected (MALA) version is more delicate to approximate (?) but at the very least it ensures the Markov chain does not diverge. Even when the unadjusted Langevin has a stationary regime, its joint distribution is likely quite far from the joint distribution of a proper discretisation. Now this also made me think about a parameterised version in the 1996 paper spirit, but there is nothing specific about MALA that would prevent the implementation of the general principle. As for the unadjusted version, the joint distribution is directly available.  (But not necessarily the marginals.)

Personally, I think the most interesting part is the practical performance gain in terms of estimation accuracy for fixed CPU time, combined with the convergence guarantee from the CLT. ULA was particularly important to us because of the papers of Arnak Dalalyan, Alain Durmus & Eric Moulines and recently from Mike Jordan’s group, which all look at an unadjusted Langevin diffusion (and unimodal target distributions). But MALA admits a Metropolis-Hastings importance sampling estimator, just as Random Walk Metropolis does – we didn’t include MALA in the experiments to not get people confused with MALA and ULA. But there is no delicacy involved whatsoever in approximating the marginal MALA proposal distribution. The beauty of our approach is that it works for almost all Metropolis-Hastings algorithms where you can evaluate the proposal density q, there is no constraint to use random walks at all (we will emphasize this more in the paper).

## ABC for missing data

Posted in Statistics, University life with tags , , , , , , on February 2, 2012 by xi'an

I received this email a few days ago:

I am an hard-core reader of your blog and thanks to your posts I have recently started being interested to ABC (and Bayesian methods in general). I am writing you to ask for suggestions on the application of the semi-automatic ABC à la Fearnhead & Prangle. The reason why I am contacting you instead of addressing the authors is because (i) you have been involved in the RSS reading of their paper and (ii) you are an authority on ABC, and therefore you are probably best suited and less biased on such issue. I am applying ABC with the semi-automatic statistics selection provided in Fearnhead and Prangle (2012) to a problem which can be formalized as a hidden Markov model. However I am unsure of whether I am making a huge mistake on the following point: let’s suppose we have an unobserved (latent) system state X (depending on an unknown parameter θ) and a corresponding “corrupted” version which is observed with some measurement error, e.g.

Y = X + ε,

where ε is the measurement error independent of X, ε is N(0, σ²), say. Now their setup does not consider measurement error, so I wonder if the following is correct. Since I can simulate n times some X’ from p(X|θ) am I allowed to use the corresponding “simulated” n corrupted measurements

Y’ = X’ + ε’

(where ε’ is a draw from p(ε|σ)) into their regression approach to determine a (vector of) summary statistic S=(S1,S2) for (θ,σ)? I mean the Y’ are draws from a p(y|X’,θ,σ) which is conditional on X’. Is this allowed? Wilkinson (2008) is the only reference I have found considering ABC with measurement-error (the ones by Dean et al (2011) and Jasra et al (2011) being too technical in my opinion to allow a practical implementation) and since he does not consider a summary statistics-based approach (e.g. Algorithm D, page 10) of course he is not in need to simulate the corrupted measurements but only the latent ones. Therefore I am a bit unsure on whether it is statistically ok to simulate Y’ conditionally on X’ or if there is some theoretical issue that makes this nonsense.

to which I replied

…about your model and question, there is no theoretical difficulty in simulating x’ then y’given x’ and a value of the parameters. The reason is that

$y' \sim \int f(y',x'|\theta) \text{d}x' = f(y'|\theta)$

.the proper marginal as defined by the model. Using the intermediate x’ is a way to bypass the integral but this is 100% correct!…

a reply followed by a further request for precision

Although your equation is clearly true, I am not sure I fully grasp the issue, so I am asking for confirmation. Yes, as you noticed I need a

y’ ~ f(y’|θ,σ)

Now it’s certainly true that I can generate a draw x’ from f(x’|θ,σ) and then plug such x’ into f(y’|x’,θ,σ) to generate y’. By proceeding this way I obtain a draw (y’,x’) from f(y’,x’|θ,σ). I think I understood your reasoning, on how by using the procedure above I am actually skipping the computation of the integral in:

$\int f(y',x'|\theta, \sigma) \text{d}x'.$

Is it basically the case that the mechanism above is just a classic simulation from a bivariate distribution, where since I am interested in the marginal f(y’|θ,σ) I simulate from the joint density f(y’,x’|θ,σ) and then discard the x’ output?

which is indeed a correct interpretation. When simulating from a joint, the marginals are by-products of the simulation.