Barker (from the lovely city of Dunedin) and Link published a paper in the American Statistician last September that I missed, as I missed their earlier email about the paper since it arrived The Day After… The paper is about a new specification of RJMCMC, almost twenty years after Peter Green’s (1995) introduction of the method. The authors use the notion of a palette, “from which all model specific parameters can be calculated” (in a deterministic way). One can see the palette ψ as an intermediary step in the move between two models. This reduces the number of bijections, if not the construction of the dreaded Jacobians!, but forces the construction of pseudo-priors on the unessential parts of ψ for every model. Because the dimension of ψ is fixed, a Gibbs sampling interleaving model index and palette value is then implementable. The conditional of the model index given the palette is available provided there are not too many models under competitions, with the probabilities recyclable towards a Rao-Blackwell approximation of the model probability. I wonder at whether or not another Rao-Blackwellisation is possible, namely to draw from all the simulated palettes a sample for the parameter of an arbitrarily chosen model.
Archive for Rao-Blackwellisation
Perrakis, Ntzoufras, and Tsionas just arXived a paper on marginal likelihood (evidence) approximation (with the above title). The idea behind the paper is to base importance sampling for the evidence on simulations from the product of the (block) marginal posterior distributions. Those simulations can be directly derived from an MCMC output by randomly permuting the components. The only critical issue is to find good approximations to the marginal posterior densities. This is handled in the paper either by normal approximations or by Rao-Blackwell estimates. the latter being rather costly since one importance weight involves B.L computations, where B is the number of blocks and L the number of samples used in the Rao-Blackwell estimates. The time factor does not seem to be included in the comparison studies run by the authors, although it would seem necessary when comparing scenarii.
After a standard regression example (that did not include Chib’s solution in the comparison), the paper considers 2- and 3-component mixtures. The discussion centres around label switching (of course) and the deficiencies of Chib’s solution against the current method and Neal’s reference. The study does not include averaging Chib’s solution over permutations as in Berkoff et al. (2003) and Marin et al. (2005), an approach that does eliminate the bias. Especially for a small number of components. Instead, the authors stick to the log(k!) correction, despite it being known for being quite unreliable (depending on the amount of overlap between modes). The final example is Diggle et al. (1995) longitudinal Poisson regression with random effects on epileptic patients. The appeal of this model is the unavailability of the integrated likelihood which implies either estimating it by Rao-Blackwellisation or including the 58 latent variables in the analysis. (There is no comparison with other methods.)
In the latest issue of Statistics and Computing (2013, Issue 23, pages 577-587), Iliopoulos and Malefaki published a paper that relates to our vanilla Rao-Blackwellisation paper with Randal Douc. The idea is to derive another approximation to the ideal importance sampling weight using the “accepted” Markov chain. (With Randal, we had a Bernoulli factory representation.) The density g(x) of the accepted chain being unknown; it is represented as the expectation under π of the function
and hence estimated by a self-normalised average based on the whole Markov chain. This means the resulting importance estimate uses twice the output of the algorithm and that it is biased and of order O(n²), thus the same order as our original Rao-Blackwellised estimator (Robert & Casella, 1996)… This also means convergence and CLT are very hard to establish: the main convergence theorem of the paper holds only for finite state spaces, with a surprising smaller asymptotic variance for this self-normalised average than for the ideal importance sampling estimator in the independent Metropolis-Hastings case. (Both are biased by being self-normalised and the paper does not consider the magnitude of those biases.)
Interestingly, the authors also ran a comparison with our parallelised Rao-Blackwellised version (with Pierre Jacob and Murray Smith), but conclude (P.58) at a larger CPU (should be GPU!!) required by the parallelisation, which does not really make sense: when compared with the plain Metropolis-Hastings implementation, run on a single processor, the parallel version only requires an extra random permutation per thread or per processor. I thus suspect a faulty implementation that induces this CPU being linear in the size of the blocks, like maybe only saving one output per block… Also interestingly, the paper re-analyses the Pima Indian probit model Jean-Michel Marin and I (and many others) used as benchmark in several of our papers. As in the most standard examples, the outcome shows a mild reduction in variance when using this estimated importance sampling version. Maybe a comparison with the ideal importance sampler (i.e. the one that does not divide by the sum of the weights since using normalised versions of the target and importance densities) would have helped in the comparison.
Here is the new version of the talk:
And I had a fairly interesting day at the conference, from Randal’s talk on hidden Markov models with finite valued observables to the two Terrys invited session (Terry Lyons vs. Terry Speed) to the machine learning session organised by a certain Michal Jordan (on the program) that turned out to be Michael Jordan (with a talk on the fusion between statistics and computability). A post-session chat with Terry Lyons also opened (to me) new perspectives on data summarisation. (And we at last managed to get a convergence result using a Rao-Blackwellisation argument!) Plus, we ended up the day in a nice bistrot called Zeller with an awfully friendly staff cooking family produces and serving fruity family wines and not yet spoiled by being ranked #1 on tripadvisor (but visibly attracting a lot of tourists like us).
In the train to Annecy, I read the recently arXived paper by my former PhD student Pierre Jacob (now at NUS), along with Lawrence Murray (Perth), and Sylvain Rubenthaler (Nice), where they obtain precise degeneracy rates of the regular particle filter applied to hidden Markov models with a compact observation space, precise enough to consider storing the entire paths at a linear occupancy rate. Interestingly, the distance to the most common ancestor is of order N log N, if N is the number of particles. And the number of nodes is O(N log N) as well. This means indeed that the whole paths can be stored, which offers a lot of potential in terms of Rao-Blackwellisation and parallelisation. I was first bemused by a silly misunderstanding about the purpose of the algorithm: it is directed at inference at the current time index, not over the whole past and not over the parameters of the model for, else how could we consider the algorithm has converged when it degenerates to a single path at some finite horizon the past? Pierre has also written a further comment of the paper on Statistfaction.