Felipe Medina-Aguayo, Antony Lee and Gareth Roberts (all at Warwick University) have recently published—even though the paper was accepted a year ago—a paper in Statistics and Computing about a variant to the pseudo-marginal Metropolis-Hastings algorithm. The modification is to simulate an estimate of the likelihood or posterior at the current value of the Markov chain at every iteration, rather than reproducing the current estimate. The reason for this refreshment of the weight estimate is to prevent stickiness in the chain, when a random weight leads to a very high value of the posterior. Unfortunately, this change leads to a Markov chain with the wrong stationary distribution. When this stationary exists! The paper actually produces examples of transient noisy chains, even in simple cases such as a geometric target distribution. And even when taking the average of a large number of weights. But the paper also contains sufficient conditions, like negative weight moments or uniform ergodicity of the proposal, for the noisy chain to be geometrically ergodic. Even though the applicability of those conditions to complex targets is not always obvious.
Archive for particle filter
Pierre Del Moral, Ajay Jasra, Kody Law, and Yan Zhou just arXived a paper entitled Sequential Monte Carlo samplers for normalizing constants. Which obviously attracted my interest! The context is one of a sequential Monte Carlo problem, with an associated sequence of targets and of attached normalising constants. While the quantity of interest only relates to the final distribution in the sequence, using Mike Giles‘ multilevel Monte Carlo approach allows for a more accurate estimation and recycling all the past particles, thanks to the telescoping formula. And the sequential representation also allows for an unbiased estimator, as is well known in the sequential Monte Carlo literature. The paper derives accurate bounds on both the variances of two normalisation constant estimators and the costs of producing such estimators (assuming there is an index typo in Corollary 3.1, where L-2 should be L-1). The improvement when compared with traditional SMC is clear on the example contained in the paper. As I read the paper rather quickly and without much attention to the notations, I may have missed the point, but I did not see any conclusion on the choice of the particle population size at each iteration of the SMC. After asking Ajay about it, he pointed out that this size can be derived as
(with notations taken from the paper).
Simon Barthelmé gave his mini-course on EP, with loads of details on the implementation of the method. Focussing on the EP-ABC and MCMC-EP versions today. Leaving open the difficulty of assessing to which limit EP is converging. But mentioning the potential for asynchronous EP (on which I would like to hear more). Ironically using several times a logistic regression example, if not on the Pima Indians benchmark! He also talked about approximate EP solutions that relate to consensus MCMC. With a connection to Mark Beaumont’s talk at NIPS [at the time as mine!] on the comparison with ABC. While we saw several talks on EP during this week, I am still agnostic about the potential of the approach. It certainly produces a fast proxy to the true posterior and hence can be exploited ad nauseam in inference methods based on pseudo-models like indirect inference. In conjunction with other quick and dirty approximations when available. As in ABC, it would be most useful to know how far from the (ideal) posterior distribution does the approximation stands. Machine learning approaches presumably allow for an evaluation of the predictive performances, but less so for the modelling accuracy, even with new sampling steps. [But I know nothing, I know!]
Dennis Prangle presented some on-going research on high dimension [data] ABC. Raising the question of what is the true meaning of dimension in ABC algorithms. Or of sample size. Because the inference relies on the event d(s(y),s(y’))≤ξ or on the likelihood l(θ|x). Both one-dimensional. Mentioning Iain Murray’s talk at NIPS [that I also missed]. Re-expressing as well the perspective that ABC can be seen as a missing or estimated normalising constant problem as in Bornn et al. (2015) I discussed earlier. The central idea is to use SMC to simulate a particle cloud evolving as the target tolerance ξ decreases. Which supposes a latent variable structure lurking in the background.
Judith Rousseau gave her talk on non-parametric mixtures and the possibility to learn parametrically about the component weights. Starting with a rather “magic” result by Allman et al. (2009) that three repeated observations per individual, all terms in a mixture are identifiable. Maybe related to that simpler fact that mixtures of Bernoullis are not identifiable while mixtures of Binomial are identifiable, even when n=2. As “shown” in this plot made for X validated. Actually truly related because Allman et al. (2009) prove identifiability through a finite dimensional model. (I am surprised I missed this most interesting paper!) With the side condition that a mixture of p components made of r Bernoulli products is identifiable when p ≥ 2[log² r] +1, when log² is base 2-logarithm. And [x] the upper rounding. I also find most relevant this distinction between the weights and the remainder of the mixture as weights behave quite differently, hardly parameters in a sense.
“Today, a week or two spent reading Jaynes’ book can be a life-changing experience.” (p.8)
I received this book by Peter Grindrod, Mathematical underpinnings of Analytics (theory and applications), from Oxford University Press, quite a while ago. (Not that long ago since the book got published in 2015.) As a book for review for CHANCE. And let it sit on my desk and in my travel bag for the same while as it was unclear to me that it was connected with Statistics and CHANCE. What is [are?!] analytics?! I did not find much of a definition of analytics when I at last opened the book, and even less mentions of statistics or machine-learning, but Wikipedia told me the following:
“Analytics is a multidimensional discipline. There is extensive use of mathematics and statistics, the use of descriptive techniques and predictive models to gain valuable knowledge from data—data analysis. The insights from data are used to recommend action or to guide decision making rooted in business context. Thus, analytics is not so much concerned with individual analyses or analysis steps, but with the entire methodology.”
Barring the absurdity of speaking of a “multidimensional discipline” [and even worse of linking with the mathematical notion of dimension!], this tells me analytics is a mix of data analysis and decision making. Hence relying on (some) statistics. Fine.
“Perhaps in ten years, time, the mathematics of behavioural analytics will be common place: every mathematics department will be doing some of it.”(p.10)
First, and to start with some positive words (!), a book that quotes both Friedrich Nietzsche and Patti Smith cannot get everything wrong! (Of course, including a most likely apocryphal quote from the now late Yogi Berra does not partake from this category!) Second, from a general perspective, I feel the book meanders its way through chapters towards a higher level of statistical consciousness, from graphs to clustering, to hidden Markov models, without precisely mentioning statistics or statistical model, while insisting very much upon Bayesian procedures and Bayesian thinking. Overall, I can relate to most items mentioned in Peter Grindrod’s book, but mostly by first reconstructing the notions behind. While I personally appreciate the distanced and often ironic tone of the book, reflecting upon the author’s experience in retail modelling, I am thus wondering at which audience Mathematical underpinnings of Analytics aims, for a practitioner would have a hard time jumping the gap between the concepts exposed therein and one’s practice, while a theoretician would require more formal and deeper entries on the topics broached by the book. I just doubt this entry will be enough to lead maths departments to adopt behavioural analytics as part of their curriculum… Continue reading
In the few past days, there has been so many arXiv postings of interest—presumably the NIPS submission effect!—that I cannot hope to cover them in the coming weeks! Hopefully, some will still come out on the ‘Og in a near future:
- arXiv:1506.06629: Scalable Approximations of Marginal Posteriors in Variable Selection by Willem van den Boom, Galen Reeves, David B. Dunson
- arXiv:1506.06285: The MCMC split sampler: A block Gibbs sampling scheme for latent Gaussian models by Óli Páll Geirsson, Birgir Hrafnkelsson, Daniel Simpson, Helgi Sigurðarson [also deserves a special mention for gathering only ***son authors!]
- arXiv:1506.06268: Bayesian Nonparametric Modeling of Higher Order Markov Chains by Abhra Sarkar, David B. Dunson
- arXiv:1506.06117: Convergence of Sequential Quasi-Monte Carlo Smoothing Algorithms by Mathieu Gerber, Nicolas Chopin
- arXiv:1506.06101: Robust Bayesian inference via coarsening by Jeffrey W. Miller, David B. Dunson
- arXiv:1506.05934: Expectation Particle Belief Propagation by Thibaut Lienart, Yee Whye Teh, Arnaud Doucet
- arXiv:1506.05860: Variational Gaussian Copula Inference by Shaobo Han, Xuejun Liao, David B. Dunson, Lawrence Carin
- arXiv:1506.05855: The Frequentist Information Criterion (FIC): The unification of information-based and frequentist inference by Colin H. LaMont, Paul A. Wiggins
- arXiv:1506.05757: Bayesian Inference for the Multivariate Extended-Skew Normal Distribution by Mathieu Gerber, Florian Pelgrin
- arXiv:1506.05741: Accelerated dimension-independent adaptive Metropolis by Yuxin Chen, David Keyes, Kody J.H. Law, Hatem Ltaief
- arXiv:1506.05269: Bayesian Survival Model based on Moment Characterization by Julyan Arbel, Antonio Lijoi, Bernardo Nipoti
- arXiv:1506.04778: Fast sampling with Gaussian scale-mixture priors in high-dimensional regression by Anirban Bhattacharya, Antik Chakraborty, Bani K. Mallick
- arXiv:1506.04416: Bayesian Dark Knowledge by Anoop Korattikara, Vivek Rathod, Kevin Murphy, Max Welling [a special mention for this title!]
- arXiv:1506.03693: Optimization Monte Carlo: Efficient and Embarrassingly Parallel Likelihood-Free Inference by Edward Meeds, Max Welling
- arXiv:1506.03074: Variational consensus Monte Carlo by Maxim Rabinovich, Elaine Angelino, Michael I. Jordan
- arXiv:1506.02564: Gradient-free Hamiltonian Monte Carlo with Efficient Kernel Exponential Families by Heiko Strathmann, Dino Sejdinovic, Samuel Livingstone, Zoltan Szabo, Arthur Gretton [comments coming soon!]
When in Warwick last October, I met Simo Särkkä, who told me he had published an IMS monograph on Bayesian filtering and smoothing the year before. I thought it would be an appropriate book to review for CHANCE and tried to get a copy from Oxford University Press, unsuccessfully. I thus bought my own book that I received two weeks ago and took the opportunity of my Czech vacations to read it… [A warning pre-empting accusations of self-plagiarism: this is a preliminary draft for a review to appear in CHANCE under my true name!]
“From the Bayesian estimation point of view both the states and the static parameters are unknown (random) parameters of the system.” (p.20)
Bayesian filtering and smoothing is an introduction to the topic that essentially starts from ground zero. Chapter 1 motivates the use of filtering and smoothing through examples and highlights the naturally Bayesian approach to the problem(s). Two graphs illustrate the difference between filtering and smoothing by plotting for the same series of observations the successive confidence bands. The performances are obviously poorer with filtering but the fact that those intervals are point-wise rather than joint, i.e., that the graphs do not provide a confidence band. (The exercise section of that chapter is superfluous in that it suggests re-reading Kalman’s original paper and rephrases the Monty Hall paradox in a story unconnected with filtering!) Chapter 2 gives an introduction to Bayesian statistics in general, with a few pages on Bayesian computational methods. A first remark is that the above quote is both correct and mildly confusing in that the parameters can be consistently estimated, while the latent states cannot. A second remark is that justifying the MAP as associated with the 0-1 loss is incorrect in continuous settings. The third chapter deals with the batch updating of the posterior distribution, i.e., that the posterior at time t is the prior at time t+1. With applications to state-space systems including the Kalman filter. The fourth to sixth chapters concentrate on this Kalman filter and its extension, and I find it somewhat unsatisfactory in that the collection of such filters is overwhelming for a neophyte. And no assessment of the estimation error when the model is misspecified appears at this stage. And, as usual, I find the unscented Kalman filter hard to fathom! The same feeling applies to the smoothing chapters, from Chapter 8 to Chapter 10. Which mimic the earlier ones. Continue reading
Last evening, I read a nice paper with the above title by Drovandi, Pettitt and McCutchan, from QUT, Brisbane. Low count refers to observation with a small number of integer values. The idea is to mix ABC with the unbiased estimators of the likelihood proposed by Andrieu and Roberts (2009) and with particle MCMC… And even with a RJMCMC version. The special feature that makes the proposal work is that the low count features allows for a simulation of pseudo-observations (and auxiliary variables) that may sometimes authorise an exact constraint (that the simulated observation equals the true observation). And which otherwise borrows from Jasra et al. (2013) “alive particle” trick that turns a negative binomial draw into an unbiased estimation of the ABC target… The current paper helped me realise how powerful this trick is. (The original paper was arXived at a time I was off, so I completely missed it…) The examples studied in the paper may sound a wee bit formal, but they could lead to a better understanding of the method since alternatives could be available (?). Note that all those examples are not ABC per se in that the tolerance is always equal to zero.
The paper also includes reversible jump implementations. While it is interesting to see that ABC (in the authors’ sense) can be mixed with RJMCMC, it is delicate to get a feeling about the precision of the results, without a benchmark to compare to. I am also wondering about less costly alternatives like empirical likelihood and other ABC alternatives. Since Chris is visiting Warwick at the moment, I am sure we can discuss this issue next week there.