Sampling latent states for high-dimensional non-linear state space models with the embedded HMM method

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , on March 17, 2016 by xi'an

Previously, I posted a comment on a paper by Alex Shestopaloff and Radford Neal, after my visit to Toronto two years ago, using a particular version of ensemble Monte Carlo. A new paper by the same authors was recently arXived, as an refinement of the embedded HMM paper of Neal (2003), in that the authors propose a new and more efficient way to generate from the (artificial) embedded hidden Markov sampler that is central to their technique of propagating a set of pool states. The method exploits both forward and backward representations of HMMs in an alternating manner. And propagates the pool states from one observation time to the next. The paper also exploits latent Gaussian structures to make autoregressive proposals, as well as flip proposals from x to -x [which seem to only make sense when 0 is a central value for the target, i.e. when the observables y only depend on |x|]. All those modifications bring the proposal quite close to (backward) particle Gibbs, the difference being in using Metropolis rather than importance steps. And in an improvement brought by the embedded HMM approach, even though it is always delicate to generalise those comparisons when some amount of calibration is required by both algorithms under comparison. (Especially delicate when it is rather remote from my area of expertise!) Anyway, I am still intrigued [in a positive way] by the embedded HMM idea as it remains mysterious that a finite length HMM simulation can improve the convergence performances that much. And wonder at a potential connection with an earlier paper of Anthony Lee and Krys Latuszynski using a random number of auxiliary variables. Presumably a wrong impression from a superficial memory…

standard distributions

Posted in Books, Kids, Statistics with tags , , , on February 5, 2016 by xi'an

Joram Soch managed to get a short note arXived about the Normal cdf Φ by exhibiting an analytical version, nothing less!!! By which he means a power series representation of that cdf. This is an analytical [if known] function in the complex calculus sense but I wonder at the point of the (re)derivation. (I do realise that something’s wrong on the Internet is not breaking news!)

Somewhat tangentially, this reminds me of a paper I read recently where the Geometric Geo(p) distribution was represented as the sum of two independent variates, namely a Binomial B(p/(1+p)) variate and a Geometric 2G(p²) variate. A formula that can be iterated for arbitrarily long, meaning that a Geometric variate is an infinite sum of [powers of two] weighted Bernoulli variates. I like this representation very much (although it may well have been know for quite a while). However I fail to see how to take advantage of it for simulation purposes. Unless the number of terms in the sum can be determined first. And even then it would be less efficient than simulating a single Geometric…

importance sampling with infinite variance

Posted in pictures, R, Statistics, University life with tags , , , , , , , on November 13, 2015 by xi'an

“In this article it is shown that in a fairly general setting, a sample of size approximately exp(D(μ|ν)) is necessary and sufficient for accurate estimation by importance sampling.”

Sourav Chatterjee and Persi Diaconis arXived yesterday an exciting paper where they study the proper sample size in an importance sampling setting with no variance. That’s right, with no variance. They give as a starting toy example the use of an Exp(1) proposal for an Exp(1/2) target, where the importance ratio exp(x/2)/2 has no ξ order moment (for ξ≥2). So the infinity in the variance is somehow borderline in this example, which may explain why the estimator could be considered to “work”. However, I disagree with the statement “that a sample size a few thousand suffices” for the estimator of the mean to be close to the true value, that is, 2. For instance, the picture I drew above is the superposition of 250 sequences of importance sampling estimators across 10⁵ iterations: several sequences show huge jumps, even for a large number of iterations, which are characteristic of infinite variance estimates. Thus, while the expected distance to the true value can be closely evaluated via the Kullback-Leibler divergence between the target and the proposal (which by the way is infinite when using a Normal as proposal and a Cauchy as target), there are realisations of the simulation path that can remain far from the true value and this for an arbitrary number of simulations. (I even wonder if, for a given simulation path, waiting long enough should not lead to those unbounded jumps.) The first result is frequentist, while the second is conditional, i.e., can occur for the single path we have just simulated… As I taught in class this very morning, I thus remain wary about using an infinite variance estimator. (And not only in connection with the harmonic mean quagmire. As shown below by the more extreme case of simulating an Exp(1) proposal for an Exp(1/10) target, where the mean is completely outside the range of estimates.) Wary, then, even though I find the enclosed result about the existence of a cut-off sample size associated with this L¹ measure quite astounding. Continue reading

a third way of probability?

Posted in Books, Mountains, Statistics, Travel, University life with tags , , , , , , on September 5, 2015 by xi'an

Because the title intrigued me (who would dream of claiming connection with Tony Blair’s “new” Labour move to centre-right?!) , I downloaded William Briggs‘ paper the Third Way of Probability & Statistics from arXiv and read it while secluded away, with no connection to the outside world, at Longmire, Mount Rainier National Park. Early morning at Paradise Inn. The subtitle of the document is “Beyond Testing and Estimation To Importance, Relevance, and Skill“. Actually, Longmire may have been the only place where I would read through the entire paper and its 14 pages, as the document somewhat sounds like a practical (?) joke. And almost made me wonder whether Mr Briggs was a pseudonym… And where the filter behind arXiv publishing principles was that day.

The notion behind Briggs’ third way is that parameters do not exist and that only conditional probability exists. Not exactly a novel perspective then. The first five pages go on repeating this principle in various ways, without ever embarking into the implementation of the idea, at best referring to a future book in search of a friendly publisher… The remainder of the paper proceeds to analyse a college GPA dataset without ever explaining how the predictive distribution was constructed. The only discussion is about devising a tool to compare predictors, which is chosen as the continuous rank probability score of Gneiting and Raftery (2007). Looking at those scores seems to encompass this third way advocated by the author, then, which sounds to me to be an awfully short lane into statistics. With no foray whatsoever into probability.

astronomical evidence

Posted in pictures, Statistics, University life with tags , , , , , , , , , , , , on July 24, 2015 by xi'an

As I have a huge arXiv backlog and an even higher non-arXiv backlog, I cannot be certain I will find time to comment on those three recent and quite exciting postings connecting ABC with astro- and cosmo-statistics [thanks to Ewan for pointing out those to me!]:

arXiv frenzy

Posted in R, Statistics, University life with tags , , , , , , on June 23, 2015 by xi'an

In the few past days, there has been so many arXiv postings of interest—presumably the NIPS submission effect!—that I cannot hope to cover them in the coming weeks! Hopefully, some will still come out on the ‘Og in a near future:

• Scalable Approximations of Marginal Posteriors in Variable Selection by Willem van den Boom, Galen Reeves, David B. Dunson
• The MCMC split sampler: A block Gibbs sampling scheme for latent Gaussian models by Óli Páll Geirsson, Birgir Hrafnkelsson, Daniel Simpson, Helgi Sigurðarson [also deserves a special mention for gathering only ***son authors!]
• Bayesian Nonparametric Modeling of Higher Order Markov Chains by Abhra Sarkar, David B. Dunson
• Convergence of Sequential Quasi-Monte Carlo Smoothing Algorithms by Mathieu Gerber, Nicolas Chopin
• Robust Bayesian inference via coarsening by Jeffrey W. Miller, David B. Dunson
• Expectation Particle Belief Propagation by Thibaut Lienart, Yee Whye Teh, Arnaud Doucet
• arXiv:1506.05860: Variational Gaussian Copula Inference by Shaobo Han, Xuejun Liao, David B. Dunson, Lawrence Carin
• arXiv:1506.05855: The Frequentist Information Criterion (FIC): The unification of information-based and frequentist inference by Colin H. LaMont, Paul A. Wiggins
• arXiv:1506.05757: Bayesian Inference for the Multivariate Extended-Skew Normal Distribution by Mathieu Gerber, Florian Pelgrin
• arXiv:1506.05741: Accelerated dimension-independent adaptive Metropolis by Yuxin Chen, David Keyes, Kody J.H. Law, Hatem Ltaief
• arXiv:1506.05269: Bayesian Survival Model based on Moment Characterization by Julyan Arbel, Antonio Lijoi, Bernardo Nipoti
• arXiv:1506.04778: Fast sampling with Gaussian scale-mixture priors in high-dimensional regression by Anirban Bhattacharya, Antik Chakraborty, Bani K. Mallick
• arXiv:1506.04416: Bayesian Dark Knowledge by Anoop Korattikara, Vivek Rathod, Kevin Murphy, Max Welling [a special mention for this title!]
• arXiv:1506.03693: Optimization Monte Carlo: Efficient and Embarrassingly Parallel Likelihood-Free Inference by Edward Meeds, Max Welling
• arXiv:1506.03074: Variational consensus Monte Carlo by Maxim Rabinovich, Elaine Angelino, Michael I. Jordan
• arXiv:1506.02564: Gradient-free Hamiltonian Monte Carlo with Efficient Kernel Exponential Families by Heiko Strathmann, Dino Sejdinovic, Samuel Livingstone, Zoltan Szabo, Arthur Gretton [comments coming soon!]

non-reversible MCMC

Posted in Books, Statistics, University life with tags , , , , , , on May 21, 2015 by xi'an

While visiting Dauphine, Natesh Pillai and Aaron Smith pointed out this interesting paper of Joris Bierkens (Warwick) that had escaped my arXiv watch/monitoring. The paper is about turning Metropolis-Hastings algorithms into non-reversible versions, towards improving mixing.

In a discrete setting, a way to produce a non-reversible move is to mix the proposal kernel Q with its time-reversed version Q’ and use an acceptance probability of the form

$\epsilon\pi(y)Q(y,x)+(1-\epsilon)\pi(x)Q(x,y) \big/ \pi(x)Q(x,y)$

where ε is any weight. This construction is generalised in the paper to any vorticity (skew-symmetric with zero sum rows) matrix Γ, with the acceptance probability

$\epsilon\Gamma(x,y)+\pi(y)Q(y,x)\big/\pi(x)Q(x,y)$

where ε is small enough to ensure all numerator values are non-negative. This is a rather annoying assumption in that, except for the special case derived from the time-reversed kernel, it has to be checked over all pairs (x,y). (I first thought it also implied the normalising constant of π but everything can be set in terms of the unormalised version of π, Γ or ε included.) The paper establishes that the new acceptance probability preserves π as its stationary distribution. An alternative construction is to make the proposal change from Q in H such that H(x,y)=Q(x,y)+εΓ(x,y)/π(x). Which seems more pertinent as not changing the proposal cannot improve that much the mixing behaviour of the chain. Still, the move to the non-reversible versions has the noticeable plus of decreasing the asymptotic variance of the Monte Carlo estimate for any integrable function. Any. (Those results are found in the physics literature of the 2000’s.)

The extension to the continuous case is a wee bit more delicate. One needs to find an anti-symmetric vortex function g with zero integral [equivalent to the row sums being zero] such that g(x,y)+π(y)q(y,x)>0 and with same support as π(x)q(x,y) so that the acceptance probability of g(x,y)+π(y)q(y,x)/π(x)q(x,y) leads to π being the stationary distribution. Once again g(x,y)=ε(π(y)q(y,x)-π(x)q(x,y)) is a natural candidate but it is unclear to me why it should work. As the paper only contains one illustration for the discretised Ornstein-Uhlenbeck model, with the above choice of g for a small enough ε (a point I fail to understand since any ε<1 should provide a positive g(x,y)+π(y)q(y,x)), it is also unclear to me that this modification (i) is widely applicable and (ii) is relevant for genuine MCMC settings.