yet more questions about Monte Carlo Statistical Methods

Posted in Books, Statistics, University life with tags , , , , , , , , , , on December 8, 2011 by xi'an

As a coincidence, here is the third email I this week about typos in Monte Carlo Statistical Method, from Peng Yu this time. (Which suits me well in terms of posts as  I am currently travelling to Provo, Utah!)

I’m reading the section on importance sampling. But there are a few cases in your book MCSM2 that are not clear to me.

On page 96: “Theorem 3.12 suggests looking for distributions g for which |h|f/g is almost constant with finite variance.”

What is the precise meaning of “almost constant”? If |h|f/g is almost constant, how come its variance is not finite?

“Almost constant” is not a well-defined property, I am afraid. By this sentence on page 96 we meant using densities g that made |h|f/g as little varying as possible while being manageable. Hence the insistence on the finite variance. Of course, the closer |h|f/g is to a constant function the more likely the variance is to be finite.

“It is important to note that although the finite variance constraint is not necessary for the convergence of (3.8) and of (3.11), importance sampling performs quite poorly when (3.12) ….”

It is not obvious to me why when (3.12) importance sampling performs poorly. I might have overlooked some very simple facts. Would you please remind me why it is the case? From the previous discussion in the same section, it seems that h(x) is missing in (3.12). I think that (3.12) should be (please compare with the first equation in section 3.3.2)

$\int h^2(x) f^2(x) / g(x) \text{d}x = + \infty$

The preference for a finite variance of f/g and against (3.12) is that we would like the importance function g to work well for most integrable functions h. Hence a requirement that the importance weight f/g itself behaves well. It guarantees some robustness across the h‘s and also avoids checking for the finite variance (as in your displayed equation) for all functions h that are square-integrable against g, by virtue of the Cauchy-Schwarz inequality.

Alternative evidence estimation in phylogenetics

Posted in Statistics with tags , , , , , on January 21, 2010 by xi'an

An alternative marginal likelihood estimator for phylogenetic models is the title of the paper recently posted by Arima and Tardella on arXiv. While working on phylogenetic trees, it does not consider an ABC approach for computing the evidence but instead relies on harmonic mean estimators (since thermodynamic alternatives take too long). It is based upon the harmonic mean estimator of the marginal (already discussed in this post), namely which

$\mathbb{E}\left[ \dfrac{\varphi(\theta)}{\pi(\theta) L(\theta|x)} \mid x \right] = \dfrac{1}{m(x)}$

holds for any density φ. As Arima and Tardella point out, the special case when φ is the prior (Newton and Raftery, 1994, JRSS Series B) is still the one used in phylogenetic softwares like MrBayes, PHASE and BEAST, despite clear warnings about its unreliability! In An alternative marginal likelihood estimator for phylogenetic models, the authors propose a choice of φ that avoids the infinite variance problem of the original harmonic mean estimator. They perturbate the unormalised posterior distribution, of the kind (in one dimension, when the mode is near zero)

$\tilde\pi(\theta|x)$ into $\tilde\pi_P(\theta|x)=\begin{cases}\tilde\pi(\theta+\delta) &\text{if }\theta<-\delta\\ \tilde\pi(\theta-\delta) &\text{if }\theta>-\delta\\ \tilde\pi(0) &\text{otherwise}\end{cases}$

by pulling the density up by a controlled amount of $2\delta\tilde\pi(0|x)$. I find it interesting if only because it goes against my own intuition (and solution) of removing mass from the posterior by concentrating around the mode. The variance of the corresponding harmonic mean estimator is finite if the tails of $\tilde\pi(\theta|x)$ do not decrease too rapidly, I think. The remainder of the paper studies the impact of this approximation technique in the setting of phylogenic models for comparing trees.

A vanilla Rao-Blackwellisation (2)

Posted in Statistics with tags , , , , on April 16, 2009 by xi'an

Following yesterday’s post on our vanilla Rao-Blackwellisation, Nicolas Chopin pointed out to me today a related paper by Hugo Hammer and Hakon Tjelmeland, Control Variates for the Metropolis–Hastings Algorithm, that was published last year in the Scandinavian Journal of Statistics. The approach they adopt is to introduce control variates based on the Metropolis-Hastings Markov chain as well as on the proposed moves. Those control variates are based on zero-mean functions

$g(x_t,y_t) = w_1(x_t,y_t) f(x_t) +w_2(x_t,y_t) f(y_t)$

with

$w_1(x,y)\pi(x)q(x,y) = -w_2(x,y)\pi(y)q(y,x)$

which leads to both the Barker (1965) and the traditional $\alpha(x,y)$ acceptance probabilities. The main connection with our vanilla Rao-Blackwellisation paper is that the computation burden of using control variates is also of order $\text{O}(N)$, because the variance improvement brought by the control variate technique [as opposed to Rao-Blackwellisation] is difficult to assess theoretically when the optimal regression coefficient in the control variate estimate is approximated by the empirical correlation across MCMC iterations. Further, the control variate improvement is “local” in that it only depends on $(x_t,y_t)$ at iteration t rather than on the sequence of proposed values as in Rao-Blackwellization of sampling schemes or on the sequence of accepted values and rejected proposals as in the vanilla Rao-Blackwellisation paper. An interesting application of the paper deals with reversible jump settings since both the control variate and the vanilla Rao-Blackwellisation techniques apply to functions of the Markov chain that remain defined across models.

Arnaud Guillin also kindly pointed out a typo in the Metropolis-Hastings acceptance probability, typo that now stands corrected on arXiv!

A vanilla Rao-Blackwellisation

Posted in Statistics with tags , , , , on April 15, 2009 by xi'an

When we wrote Rao-Blackwellization of sampling schemes with George Casella, in 1996, which still is one of my favourite papers!, we used a [latent variable] representation of the standard average as a function of the proposed moves

$\frac{1}{N} \sum_{t=1}^N \sum_{j=1}^t h(y_j) \mathbb{I}(x_t=y_j)$

that could be integrated in the accepted $x_t$‘s given the proposed $y_j$‘s. Taking this conditional expectation is always possible, which made the result so exciting!, but also costly, since all possible histories of acceptance must be considered, unfortunately leading to a tree-like complexity of order $\text{O}(N^2)$. Hence a lack of followers, despite a high number of citations (248 recorded by Google Scholar).

In this new version, written with Randal Douc and now posted on arXiv, we found—during this inspiring MCMC meeting at the University of Warwick—a different representation of the Metropolis-Hastings estimators that allows for a conditioning only on the accepted moves

$\frac{1}{N} \sum_{j=1}^M h(z_j) \sum_{t=0}^\infty \prod_{m=1}^{t-1}\mathbb{I}\{u_{jm}\ge \alpha(z_j,y_{jm})\}$

where the $z_j$‘s are the accepted $y_t$‘s, $M$ is the number of accepted $y_t$‘s, the $y_{jt}$‘s are simulated from the proposal $q(z_j,y)$ and $\alpha(x,y)$ is the Metropolis-Hastings acceptance probability. While both representations give identical estimators, integrating out the $u_{jt}$‘s in the above leads to a different and much simpler Rao-Blackwellised version,

$\frac{1}{N} \sum_{j=1}^M h(z_j) \sum_{t=0}^\infty \prod_{m=1}^{t-1}\{1- \alpha(z_j,y_{jm})\}$

with a computing cost that remains of the same order (and which, further, can be controlled as explained in the paper). While the derivation of this Rao-Blackwellised estimator is straightforward, establishing the (asymptotic) domination of the Rao-Blackwellised version is harder than in the 1996 Biometrika paper, since the conditional expectations are conducted conditional on all $z_j$‘s, which means the Rao-Blackwellised estimator is a sum of conditional expecttations and this makes the integrated terms correlated which requires much more complex convergence theorems… Note also that, since we condition at an earlier stage, the practical improvement brought by this “vanilla” version is unsurprisingly more limited than in the Biometrika paper. But the nice thing about this paper is that it is basically hassle-free: it applies in every situation the Metropolis-Hastings algorithm is used and only requires to keep track of the proposed moves (plus some more)…