## Initializing adaptive importance sampling with Markov chains

Posted in Statistics with tags , , , , , , , , , , , on May 6, 2013 by xi'an

Another paper recently arXived by Beaujean and Caldwell elaborated on our population Monte Carlo papers (Cappé et al., 2005, Douc et al., 2007, Wraith et al., 2010) to design a more thorough starting distribution. Interestingly, the authors mention the fact that PMC is an EM-type algorithm to emphasize the importance of the starting distribution, as with “poor proposal, PMC fails as proposal updates lead to a consecutively poorer approximation of the target” (p.2). I had not thought of this possible feature of PMC, which indeed proceeds along integrated EM steps, and thus could converge to a local optimum (if not poorer than the start as the Kullback-Leibler divergence decreases).

The solution proposed in this paper is similar to the one we developed in our AMIS paper. An important part of the simulation is dedicated to the construction of the starting distribution, which is a mixture deduced from multiple Metropolis-Hastings runs. I find the method spends an unnecessary long time on refining this mixture by culling the number of components: down-the-shelf clustering techniques should be sufficient, esp. if one considers that the value of the target is available at every simulated point. This has been my pet (if idle) theory for a long while: we do not take (enough) advantage of this informative feature in our simulation methods… I also find the Student’s t versus Gaussian kernel debate (p.6) somehow superfluous: as we shown in Douc et al., 2007, we can process Student’s t distributions so we can as well work with those. And rather worry about the homogeneity assumption this choice implies: working with any elliptically symmetric kernel assumes a local Euclidean structure on the parameter space, for all components, and does not model properly highly curved spaces. Another pet theory of mine’s. As for picking the necessary number of simulations at each PMC iteration, I would add to the ESS and the survival rate of the components a measure of the Kullback-Leibler divergence, as it should decrease at each iteration (with an infinite number of particles).

Another interesting feature is in the comparison with Multinest, the current version of nested sampling, developed by Farhan Feroz. This is the second time I read a paper involving nested sampling in the past two days. While this PMC implementation does better than nested sampling on the examples processed in the paper, the Multinest outcome remains relevant, particularly because it handles multi-modality fairly well. The authors seem to think parallelisation is an issue with nested sampling, while I do see why: at the most naïve stage, several nested samplers can be run in parallel and the outcomes pulled together.

## ISBA regional meeting in Varanasi (day 3)

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , on January 11, 2013 by xi'an

On the last/my day of the ISBA meeting in Varanasi, I attended a few talks before being kindly driven to the airport (early, too early, but with the unpredictable traffic there, it was better to err on the cautionary side!). In the dynamical model session, Simon Wilson presented a way to approximate posteriors for HMMs based on Chib’s (or Bayes’!) formula, while Jonathan Stroud exposed another approach to state-space model approximation involving a move of the state parameter based on a normal approximation of its conditional given the observable, approximation which seemed acceptable for the cloud analysis model he was processing. Nicolas Chopin then gave a quick introduction to particle MCMC, all the way to SMC². (As a stern chairmain of the session, I know Nicolas felt he did not have enough time but he did a really good job of motivating those different methods, in particular in explaining why the auxiliary variable approach makes the unbiased estimator of the likelihood a valid MCMC method.) Peter Green’s plenary talk was about a emission tomography image analysis whose statistical processing turned into a complex (Bernstein-von Mises) convergence theorem (whose preliminary version I saw in Bristol during Natalia Bochkina’s talk).

Overall, as forewarned by and expected from the program, this ISBA meeting was of the highest scientific quality. (I only wish I had had hindi god abilities to duplicate and attend several parallel sessions at the same time!) Besides, much besides!, the wamr attention paid to everyone by the organisers was just simply un-be-lie-vable! The cultural program went in par with the scientific program. The numerous graduate students and faculty involved in the workshop organisation had a minute knowledge of our schedules and locations, and were constantly anticipating our needs and moves. Almost to a fault, i.e. to a point that was close to embarassing for our cultural habits. I am therefore immensely grateful [personally and as former ISBA president] to all those people that contributed to the success of this ISBA meeting and first and foremost to Professor Satyanshu Upadhyay who worked relentlessly towards this goal during many months! (As a conference organiser, I realise I was and am simply unable to provide this level of welcome to the participants, even for much smaller meetings… The contrast with my previous conference in Berlin could not be more extreme as, for a much higher registration fee, the return was very, very limited.) I will forever (at least until my next reincarnation!) keep the memory of this meeting as a very special one, quite besides giving me the opportunity of my first visit to India

## twisted filters

Posted in Statistics with tags , , , , , , , , , , on October 6, 2012 by xi'an

Nick Witheley (Bristol) and Anthony Lee (Warwick) just posted an interesting paper called ‘Twisted particle filters‘ on arXiv. (Presumably unintentionally, the title sounds like Twisted Sister, pictured above, even though I never listened to this [particularly] heavy kind of hard rock! Twisting is customarily used in the large deviation literature.)

The twisted particle paper studies the impact of the choice of something similar to, if subtly different from, an importance function on the approximation of the marginal density (or evidence) for HMMS. (In essence, the standard particle filter is modified for only one particle in the population.) The core of the paper is to compare those importance functions in a fixed N-large n setting. As in simpler importance sampling cases, there exists an optimal if impractical choice of importance function, leading to a zero variance estimator of the evidence. Nick and Anthony produce an upper bound on the general variance as well. One of the most appealing features of the paper is that the authors manage a convergence result in n rather than N. (Although the algorithms are obviously validated in the more standard large N sense.)

The paper is quite theoretical and dense (I was about to write heavy in connection with the above!), with half its length dedicated to proofs. It relies on operator theory, with eigen-functions behind the optimal filter, while not unrelated with Pierre Del Moral’s works. (It took me a while to realise that the notation ω was not the elemental draw from the σ-algebra but rather the realisation of the observed sequence! And I had to keep recalling that θ was the lag operator and not a model parameter [there is no model parameter].)

## Harmonic means, again again

Posted in Books, R, Statistics, University life with tags , , , , , , , , on January 10, 2012 by xi'an

Another arXiv posting I had had no time to comment is Nial Friel’s and Jason Wyse’s “Estimating the model evidence: a review“. This is a review in the spirit of two of our papers, “Importance sampling methods for Bayesian discrimination between embedded models” with Jean-Michel Marin (published in Jim Berger Feitschrift, Frontiers of Statistical Decision Making and Bayesian Analysis: In Honor of James O. Berger, but not mentioned in the review) and “Computational methods for Bayesian model choice” with Darren Wraith (referred to by the review). Indeed, it considers a series of competing computational methods for approximating evidence, aka marginal likelihood:

The paper correctly points out the difficulty with the naïve harmonic mean estimator. (But it does not cover the extension to the finite variance solutions found in”Importance sampling methods for Bayesian discrimination between embedded models” and in “Computational methods for Bayesian model choice“.)  It also misses the whole collection of bridge and umbrella sampling techniques covered in, e.g., Chen, Shao and Ibrahim, 2000 . In their numerical evaluations of the methods, the authors use the Pima Indian diabetes dataset we also used in “Importance sampling methods for Bayesian discrimination between embedded models“. The outcome is that the Laplace approximation does extremely well in this case (due to the fact that the posterior is very close to normal), Chib’s method being a very near second. The harmonic mean estimator does extremely poorly (not a suprise!) and the nested sampling approximation is not as accurate as the other (non-harmonic) methods. If we compare with our 2009 study, importance sampling based on the normal approximation (almost the truth!) did best, followed by our harmonic mean solution based on the same normal approximation. (Chib’s solution was then third, with a standard deviation ten times larger.)

## Harmonic means, again

Posted in Statistics, University life with tags , , , , , , , , on January 3, 2012 by xi'an

Over the non-vacation and the vacation breaks of the past weeks, I skipped a lot of arXiv postings. This morning, I took a look at “Probabilities of exoplanet signals from posterior samplings” by Mikko Tuomi and Hugh R. A. Jones. This is a paper to appear in Astronomy and Astrophysics, but the main point [to me] is to study a novel approximation to marginal likelihood. The authors propose what looks at first as defensive sampling: given a likelihood f(x|θ) and a corresponding Markov chaini), the approximation is based on the following importance sampling representation

$\hat m(x) = \sum_{i=h+1}^N \dfrac{f(x|\theta_i)}{(1-\lambda) f(x|\theta_i) + \lambda f(x|\theta_{i-h})}\Big/$

$\sum_{i=h+1}^N \dfrac{1}{(1-\lambda) f(x|\theta_i) + \lambda f(x|\theta_{i-h})}$

This is called a truncated posterior mixture approximation and, under closer scrutiny, it is not defensive sampling. Indeed the second part in the denominators does not depend on the parameter θi, therefore, as far as importance sampling is concerned, this is a constant (if random) term! The authors impose a bounded parameter space for this reason, however I do not see why such an approximation is converging. Except when λ=0, of course, which brings us back to the original harmonic mean estimator. (Properly rejected by the authors for having a very slow convergence. Or, more accurately, generally no stronger convergence than the law of large numbers.)  Furthermore, the generic importance sampling argument does not work here since, if

$g(\theta) \propto (1-\lambda) \pi(\theta|x) + \lambda \pi(\theta_{i-h}|x)$

is the importance function, the ratio

$\dfrac{\pi(\theta_i)f(x|\theta_i)}{(1-\lambda) \pi(\theta|x) + \lambda \pi(\theta_{i-h}|x)}$

does not simplify… I do not understand either why the authors compare Bayes factors approximations based on this technique, on the harmonic mean version or on Chib and Jeliazkov’s (2001) solution with both DIC and AIC, since the later are not approximations to the Bayes factors. I am therefore quite surprised at the paper being accepted for publication, given that the numerical evaluation shows the impact of the coefficient λ does not vanish with the number of simulations. (Which is logical given the bias induced by the additional term.)