## PMC for combinatoric spaces

Posted in Statistics, University life with tags , , , , , , , on July 28, 2014 by xi'an

I received this interesting [edited] email from Xiannian Fan at CUNY:

I am trying to use PMC to solve Bayesian network structure learning problem (which is in a combinatorial space, not continuous space).

In PMC, the proposal distributions qi,t can be very flexible, even specific to each iteration and each instance. My problem occurs due to the combinatorial space.

For importance sampling, the requirement for proposal distribution, q, is:

support (p) ⊂ support (q)             (*)

For PMC, what is the support of the proposal distribution in iteration t? is it

support (p) ⊂ U support(qi,t)    (**)

or does (*) apply to every qi,t?

For continuous problem, this is not a big issue. We can use random walk of Normal distribution to do local move satisfying (*). But for combination search, local moving only result in finite states choice, just not satisfying (*). For example for a permutation (1,3,2,4), random swap has only choose(4,2)=6 neighbor states.

Fairly interesting question about population Monte Carlo (PMC), a sequential version of importance sampling we worked on with French colleagues in the early 2000’s.  (The name population Monte Carlo comes from Iba, 2000.)  While MCMC samplers do not have to cover the whole support of p at each iteration, it is much harder for importance samplers as their core justification is to provide an unbiased estimator to for all integrals of interest. Thus, when using the PMC estimate,

1/n ∑i,t {p(xi,t)/qi,t(xi,t)}h(qi,t),  xi,t~qi,t(x)

this estimator is only unbiased when the supports of the qi,t “s are all containing the support of p. The only other cases I can think of are

1. associating the qi,t “s with a partition Si,t of the support of p and using instead

i,t {p(xi,t)/qi,t(xi,t)}h(qi,t), xi,t~qi,t(x)

2. resorting to AMIS under the assumption (**) and using instead

1/n ∑i,t {p(xi,t)/∑j,t qj,t(xi,t)}h(qi,t), xi,t~qi,t(x)

but I am open to further suggestions!

## computational methods for statistical mechanics [day #2]

Posted in Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , on June 5, 2014 by xi'an

The last “tutorial” talk at this ICMS workshop ["at the interface between mathematical statistics and molecular simulation"] was given by Tony Lelièvre on adaptive bias schemes in Langevin algorithms and on the parallel replica algorithm. This was both very interesting because of the potential for connections with my “brand” of MCMC techniques and rather frustrating as I felt the intuition behind the physical concepts like free energy and metastability was almost within my reach! The most manageable time in Tony’s talk was the illustration of the concepts through a mixture posterior example. Example that I need to (re)read further to grasp the general idea. (And maybe the book on Free Energy Computations Tony wrote with Mathias Rousset et Gabriel Stoltz.) A definitely worthwhile talk that I hope will get posted on line by ICMS. The other talks of the day were mostly of a free energy nature, some using optimised bias in the Langevin diffusion (except for Pierre Jacob who presented his non-negative unbiased estimation impossibility result).

## Luke and Pierre at big’MC

Posted in Linux, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , on May 19, 2014 by xi'an

Yesterday, Luke Bornn and Pierre Jacob gave a talk at our big’MC ‘minar. While I had seen most of the slides earlier, either at MCMski IV,  Banff, Leuven or yet again in Oxford, I really enjoyed those talks as they provided further intuition about the techniques of Wang-Landau and non-negative unbiased estimators, leading to a few seeds of potential ideas for even more potential research. For instance, I understood way better the option to calibrate the Wang-Landau algorithm on levels of the target density rather than in the original space. Which means (a) a one-dimensional partition target (just as in nested sampling); (b) taking advantage of the existing computations of the likelihood function; and (b) a somewhat automatic implementation of the Wang-Landau algorithm. I do wonder why this technique is not more popular as a default option. (Like, would it be compatible with Stan?) The impossibility theorem of Pierre about the existence of non-negative unbiased estimators never ceases to amaze me. I started wondering during the seminar whether a positive (!) version of the result could be found. Namely, whether perturbations of the exact (unbiased) Metropolis-Hastings acceptance ratio could be substituted in order to guarantee positivity. Possibly creating drifted versions of the target…

One request in connection with this post: please connect the Institut Henri Poincaré to the eduroam wireless network! The place is dedicated to visiting mathematicians and theoretical physicists, it should have been the first one [in Paris] to get connected to eduroam. The cost cannot be that horrendous so I wonder what the reason is. Preventing guests from connecting to the Internet towards better concentration? avoiding “parasites” taking advantage of the network? ensuring seminar attendees are following the talks? (The irony is that Institut Henri Poincaré has a local wireless available for free, except that it most often does not work with my current machine. And hence wastes much more of my time as I attempt to connect over and over again while there.) Just in connection with IHP, a video of Persi giving a talk there about Poincaré, two years ago:

## speeding up MCMC

Posted in Books, Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , on May 13, 2014 by xi'an

Just before I left for Iceland, Matias Quiroz, Mattias Villani and Robert Kohn arXived a paper entitled “speeding up MCMC by efficient data subsampling”. Somewhat connected with the earlier papers by Koattikara et al., and Bardenet et al., both discussed on the ‘Og, the idea is to replace the log-likelihood by an unbiased subsampled version and to correct for the resulting bias of the exponentiation of this (Horwitz-Thompson or Hansen-Hurwitz) estimator. They ground their approach within the (currently cruising!) pseudo-marginal paradigm, even though their likelihood estimates are not completely unbiased. Since the optimal weights in the sampling step are proportional to the log-likelihood terms, they need to build a surrogate of the true likelihood, using either a Gaussian process or a spline approximation. This is all in all a very interesting contribution to the on-going debate about increasing MCMC speed when dealing with large datasets and ungainly likelihood functions. The proposed solution however has a major drawback in that the entire dataset must be stored at all times to ensure unbiasedness. For instance, the paper considers a bivariate probit model with a sample of 500,000 observations. Which must be available at all times.  Further, unless I am confused, the subsampling step requires computing the surrogate likelihood for all observations, before running the subsampling step, another costly requirement.

## MCqMC 2014 [day #1]

Posted in pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , on April 9, 2014 by xi'an

As I have been kindly invited to give a talk at MCqMC 2014, here am I. in Leuven, Belgium, for this conference I have never attended before. (I was also invited for MCqMC 2012 in Sydney The talk topics and the attendees’ “sociology” are quite similar to those of the IMACS meeting in Annecy last summer. Namely, rather little on MCMC, particle filters, and other tools familiar in Bayesian computational statistics, but a lot on diffusions and stochastic differential equations and of course quasi-Monte Carlo methods. I thus find myself at a boundary of the conference range and a wee bit lost by some talks, which even titles make little sense to me.

For instance, I have trouble to connect with multi-level Monte Carlo within my own referential. My understanding of the method is one of a control variate version of tempering, namely of using a sequence of approximations to the true target and using rougher approximations as control variates for the finer approximations. But I cannot find on the Web a statistical application of the method outside of diffusions and SDEs, i.e. outside of continuous time processes… Maybe using a particle filter from one approximation to the next, down in terms of roughness, could help.

“Several years ago, Giles (2008) introduced an intriguing multi-level idea to deal with such biased settings that can dramatically improve the rate of convergence and can even, in some settings, achieve the canonical “square root” convergence rate associated with unbiased Monte Carlo.” Rhee and Glynn, 2012

Those were my thoughts before lunchtime. today (namely April 7, 2014). And then, after lunch, Peter Glynn gave his plenary talk that just answered those questions of mine’s!!! Essentially, he showed that formula Pierre Jacob also used in his Bernoulli factory paper to transform a converging-biased-into-an-unbiased estimator, based on a telescopic series representation and a random truncation… This approach is described in a paper with Chang-han Rhee, arXived a few years ago. The talk also covered more recent work (presumably related with Chang-han Rhee’s thesis) extending the above to Markov chains. As explained to me later by Pierre Jacob [of Statisfaction fame!], a regular chain does not converge fast enough to compensate for the explosive behaviour of the correction factor, which is why Rhee and Glynn used instead a backward chain, linking to the exact or perfect samplers of the 1990’s (which origin can be related to a 1992 paper of Asmussen, Glynn and Thorisson). This was certainly the most riveting talk I attended in the past years in that it brought a direct answer to a question I was starting to investigate. And more. I was also wondering how connected it was with our “exact” representation of the stationary distribution (in an Annals of Probability paper with Jim Hobert).   Since we use a stopping rule based on renewal and a geometric waiting time, a somewhat empirical version of the inverse probability found in Peter’s talk. This talk also led me to re-consider a recent discussion we had in my CREST office with Andrew about using square root(ed) importance weights, since one of Peter’s slides exhibited those square roots as optimal. Paradoxically, Peter started the talk by down-playing it, stating there was a single idea therein and a single important slide, making it a perfect after-lunch talk: I wish I had actually had thrice more time to examine each slide! (In the afternoon session, Éric Moulines also gave a thought-provoking talk on particle islands and double bootstrap, a research project I will comment in more detail the day it gets arXived.)

## Advances in scalable Bayesian computation [day #1]

Posted in Books, Mountains, pictures, R, Statistics, University life with tags , , , , , , , , , on March 4, 2014 by xi'an

This was the first day of our workshop Advances in Scalable Bayesian Computation and it sounded like the “main” theme was probabilistic programming, in tune with my book review posted this morning. Indeed, both Vikash Mansinghka and Frank Wood gave talks about this concept, Vikash detailing the specifics of a new programming language called Venture and Frank focussing on his state-space version of the above called Anglican. This is a version of the language Church, developed to handle probabilistic models and inference (hence the joke about Anglican, “a Church of England Venture’! But they could have also added that Frank Wood was also the name of a former archbishop of Melbourne..!) I alas had an involuntary doze during Vikash’s talk, which made it harder for me to assess the fundamentals of those ventures, of how they extended beyond a “mere” new software (and of why I would invest in learning a Lisp-based language!).

The other talks of Day #1 were of a more “classical” nature with Pierre Jacob explaining why non-negative unbiased estimators were impossible to provide in general, a paper I posted about a little while ago, and including an objective Bayes example that I found quite interesting. Then Sumeet Singh (no video) presented a joint work with Nicolas Chopin on the uniform ergodicity of the particle Gibbs sampler, a paper that I should have commented here (except that it appeared just prior to The Accident!), with a nice coupling proof. And Maria Lomeli gave us an introduction to the highly general Poisson-Kingman mixture models as random measures, which encompasses all of the previously studied non-parametric random measures, with an MCMC implementation that included a latent variable representation for the alpha-stable process behind the scene, representation that could be (and maybe is) also useful in parametric analyses of alpha-stable processes.

We also had an open discussion in the afternoon that ended up being quite exciting, with a few of us voicing out some problems or questions about existing methods and others making suggestions or contradictions. We are still a wee bit short of considering a collective paper on MCMC under constraints with coherent cross-validated variational Bayes and loss-based pseudo priors, with applications to basketball data” to appear by the end of the week!

Add to this two visits to the Sally Borden Recreation Centre for morning swimming and evening climbing, and it is no wonder I woke up a bit late this morning! Looking forward Day #2!

## the alive particle filter

Posted in Books, Statistics, University life with tags , , , , , , , , on February 14, 2014 by xi'an

As mentioned earlier on the ‘Og, this is a paper written by Ajay Jasra, Anthony Lee, Christopher Yau, and Xiaole Zhang that I missed when it got arXived (as I was also missing my thumb at the time…) The setting is a particle filtering one with a growing product of spaces and constraints on the moves between spaces. The motivating example is one of an ABC algorithm for an HMM where at each time, the simulated (pseudo-)observation is forced to remain within a given distance of the true observation. The (one?) problem with this implementation is that the particle filter may well die out by making only proposals that stand out of the above ball. Based on an idea of François Le Gland and Nadia Oudjane, the authors define the alive filter by imposing a fixed number of moves onto the subset, running a negative binomial number of proposals. By removing the very last accepted particle, they show that the negative binomial experiment allows for an unbiased estimator of the normalising constant(s). Most of the paper is dedicated to the theoretical study of this scheme, with results summarised as (p.2)

1. Time uniform Lp bounds for the particle filter estimates
2. A central limit theorem (CLT) for suitably normalized and centered particle filter estimates
3. An unbiased property of the particle filter estimates
4. The relative variance of the particle filter estimates, assuming N = O(n), is shown to grow linearly in n.

The assumptions necessary to reach those impressive properties are fairly heavy (or “exceptionally strong” in the words of the authors, p.5): the original and constrained transition kernels are both uniformly ergodic, with equivalent coverage of the constrained subsets for all possible values of the particle at the previous step. I also find the proposed implementation of the ABC filtering inadequate for approximating the posterior on the parameters of the (HMM) model. Expecting every realisation of the simulated times series to be close to the corresponding observed value is too hard a constraint. The above results are scaled in terms of the number N of accepted particles but it may well be that the number of generated particles and hence the overall computing time is much larger. In the examples completing the paper, the comparison is run with an earlier ABC sampler based on the very same stepwise proximity to the observed series so the impact of this choice is difficult to assess and, furthermore, the choice of the tolerances ε is difficult to calibrate: is 3, 6, or 12 a small or large value for ε? A last question that I heard from several sources during talks on that topic is why an ABC approach would be required in HMM settings where SMC applies. Given that the ABC reproduces a simulation on the pair latent variables x parameters, there does not seem to be a gain there…