## on estimating constants…

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , on July 21, 2015 by xi'an

While I discussed on the ‘Og in the past the difference I saw between estimating an unknown parameter from a distribution and evaluating a normalising constant, evaluating such constants and hence handling [properly] doubly intractable models is obviously of the utmost importance! For this reason, Nial Friel, Helen Ogden and myself have put together a CRiSM workshop on the topic (with the tongue-in-cheek title of Estimating constants!), to be held at the University of Warwick next April 20-22.

The CRiSM workshop will focus on computational methods for approximating challenging normalising constants found in Monte Carlo, likelihood and Bayesian models. Such methods may be used in a wide range of problems: to compute intractable likelihoods, to find the evidence in Bayesian model selection, and to compute the partition function in Physics. The meeting will bring together different communities working on these related problems, some of which have developed original if little advertised solutions. It will also highlight the novel challenges associated with large data and highly complex models. Besides a dozen invited talks, the schedule will highlight two afternoon poster sessions with speed (2-5mn) oral presentations called ‘Elevator’ talks.

While 2016 is going to be quite busy with all kinds of meetings (MCMSkv, ISBA 2016, the CIRM Statistics month, AISTATS 2016, …), this should be an exciting two-day workshop, given the on-going activity in this area, and I thus suggest interested readers to mark the dates in their diary. I will obviously keep you posted about registration and accommodation when those entries are available.

## MCMskv, Lenzerheide, 4-7 Jan., 2016 [news #1]

Posted in Kids, Mountains, pictures, R, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , on July 20, 2015 by xi'an

The BayesComp MCMski V [or MCMskv for short] has now its official website, once again maintained by Merrill Lietchy from Drexel University, Philadelphia, and registration is even open! The call for contributed sessions is now over, while the call for posters remains open until the very end. The novelty from the previous post is that there will be a “Breaking news” [in-between the Late news sessions at JSM and the crash poster talks at machine-learning conferences] session to highlight major advances among poster submissions. And that there will be an opening talk by Steve [the Bayesian] Scott on the 4th, about the frightening prospect of MCMC death!, followed by a round-table and a welcome reception, sponsored by the Swiss Supercomputing Centre. Hence the change in dates. Which still allows for arrivals in Zürich on the January 4th [be with you].

## Leave the Pima Indians alone!

Posted in Books, R, Statistics, University life with tags , , , , , , , , , , , , , , , , , on July 15, 2015 by xi'an

“…our findings shall lead to us be critical of certain current practices. Specifically, most papers seem content with comparing some new algorithm with Gibbs sampling, on a few small datasets, such as the well-known Pima Indians diabetes dataset (8 covariates). But we shall see that, for such datasets, approaches that are even more basic than Gibbs sampling are actually hard to beat. In other words, datasets considered in the literature may be too toy-like to be used as a relevant benchmark. On the other hand, if ones considers larger datasets (with say 100 covariates), then not so many approaches seem to remain competitive” (p.1)

Nicolas Chopin and James Ridgway (CREST, Paris) completed and arXived a paper they had “threatened” to publish for a while now, namely why using the Pima Indian R logistic or probit regression benchmark for checking a computational algorithm is not such a great idea! Given that I am definitely guilty of such a sin (in papers not reported in the survey), I was quite eager to read the reasons why! Beyond the debate on the worth of such a benchmark, the paper considers a wider perspective as to how Bayesian computation algorithms should be compared, including the murky waters of CPU time versus designer or programmer time. Which plays against most MCMC sampler.

As a first entry, Nicolas and James point out that the MAP can be derived by standard a Newton-Raphson algorithm when the prior is Gaussian, and even when the prior is Cauchy as it seems most datasets allow for Newton-Raphson convergence. As well as the Hessian. We actually took advantage of this property in our comparison of evidence approximations published in the Festschrift for Jim Berger. Where we also noticed the awesome performances of an importance sampler based on the Gaussian or Laplace approximation. The authors call this proposal their gold standard. Because they also find it hard to beat. They also pursue this approximation to its logical (?) end by proposing an evidence approximation based on the above and Chib’s formula. Two close approximations are provided by INLA for posterior marginals and by a Laplace-EM for a Cauchy prior. Unsurprisingly, the expectation-propagation (EP) approach is also implemented. What EP lacks in theoretical backup, it seems to recover in sheer precision (in the examples analysed in the paper). And unsurprisingly as well the paper includes a randomised quasi-Monte Carlo version of the Gaussian importance sampler. (The authors report that “the improvement brought by RQMC varies strongly across datasets” without elaborating for the reasons behind this variability. They also do not report the CPU time of the IS-QMC, maybe identical to the one for the regular importance sampling.) Maybe more surprising is the absence of a nested sampling version.

In the Markov chain Monte Carlo solutions, Nicolas and James compare Gibbs, Metropolis-Hastings, Hamiltonian Monte Carlo, and NUTS. Plus a tempering SMC, All of which are outperformed by importance sampling for small enough datasets. But get back to competing grounds for large enough ones, since importance sampling then fails.

“…let’s all refrain from now on from using datasets and models that are too simple to serve as a reasonable benchmark.” (p.25)

This is a very nice survey on the theme of binary data (more than on the comparison of algorithms in that the authors do not really take into account design and complexity, but resort to MSEs versus CPus). I however do not agree with their overall message to leave the Pima Indians alone. Or at least not for the reason provided therein, namely that faster and more accurate approximations methods are available and cannot be beaten. Benchmarks always have the limitation of “what you get is what you see”, i.e., the output associated with a single dataset that only has that many idiosyncrasies. Plus, the closeness to a perfect normal posterior makes the logistic posterior too regular to pause a real challenge (even though MCMC algorithms are as usual slower than iid sampling). But having faster and more precise resolutions should on the opposite be  cause for cheers, as this provides a reference value, a golden standard, to check against. In a sense, for every Monte Carlo method, there is a much better answer, namely the exact value of the integral or of the optimum! And one is hardly aiming at a more precise inference for the benchmark itself: those Pima Indians [whose actual name is Akimel O’odham] with diabetes involved in the original study are definitely beyond help from statisticians and the model is unlikely to carry out to current populations. When the goal is to compare methods, as in our 2009 paper for Jim Berger’s 60th birthday, what matters is relative speed and relative ease of implementation (besides the obvious convergence to the proper target). In that sense bigger and larger is not always relevant. Unless one tackles really big or really large datasets, for which there is neither benchmark method nor reference value.

## generating from a failure rate function [X’ed]

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , on July 4, 2015 by xi'an

While I now try to abstain from participating to the Cross Validated forum, as it proves too much of a time-consuming activity with little added value (in the sense that answers are much too often treated as disposable napkins by users who cannot be bothered to open a textbook and who usually do not exhibit any long-term impact of the provided answer, while clogging the forum with so many questions that the individual entries seem to get so little traffic, when compared say with the stackoverflow forum, to the point of making the analogy with disposable wipes more appropriate!), I came across a truly interesting question the other night. Truly interesting for me in that I had never considered the issue before.

The question is essentially wondering at how to simulate from a distribution defined by its failure rate function, which is connected with the density f of the distribution by

$\eta(t)=\frac{f(t)}{\int_t^\infty f(x)\,\text{d}x}=-\frac{\text{d}}{\text{d}t}\,\log \int_t^\infty f(x)\,\text{d}x$

From a purely probabilistic perspective, defining the distribution through f or through η is equivalent, as shown by the relation

$F(t)=1-\exp\left\{-\int_0^t\eta(x)\,\text{d}x\right\}$

but, from a simulation point of view, it may provide a different entry. Indeed, all that is needed is the ability to solve (in X) the equation

$\int_0^X\eta(x)\,\text{d}x=-\log(U)$

when U is a Uniform (0,1) variable. Which may help in that it does not require a derivation of f. Obviously, this also begs the question as to why would a distribution be defined by its failure rate function.

Posted in Books, Statistics, University life with tags , , , , , , , , , , on July 3, 2015 by xi'an

Heiko Strathmann, Dino Sejdinovic, Samuel Livingstone, Zoltán Szabó, and Arthur Gretton arXived paper about Kamiltonian MCMC generated comments from Michael Betancourt, Dan Simpson and myself, which themselves induced the following reply by Heiko, detailed enough to deserve a post of its own.

We certainly agree that the naive approach of using a non-parametric kernel density estimator on the chain history (as in [Christian’s book, Example 8.8]) as a *proposal* fails spectacularly on simple examples: the probability of proposing in unexplored regions is extremely small, independent of the current position of the MCMC trajectory. This is not what we do though. Instead, we use the gradient of a density estimator, and not the density itself, for our HMC proposal. Just like KAMH, KMC lite in fact falls back to Random Walk Metropolis in previously unexplored regions and therefore inherits geometric ergodicity properties. This in particular includes the ability to explore previously “unseen” regions, even if adaptation has stopped. I implemented a simple illustration and comparison here.

ABC example.
The main point of the ABC example, is that our method does not suffer from the additional bias from Gaussian synthetic likelihoods when being confronted with skewed models. But there is also a computational efficiency aspect. The scheme by Meeds et al. relies on finite differences and requires $2D$ simulations from the likelihood *every time* the gradient is evaluated (i.e. every leapfrog iteration) and H-ABC discards this valuable information subsequently. In contrast, KMC accumulates gradient information from simulations: it only requires to simulate from the likelihood *once* in the accept/reject step after the leapfrog integration (where gradients are available in closed form). The density is only updated then, and not during the leapfrog integration. Similar work on speeding up HMC via energy surrogates can be applied in the tall data scenario.

Approximating HMC when gradients aren’t available is in general a difficult problem. One approach (like surrogate models) may work well in some scenarios while a different approach (i.e. Monte Carlo) may work better in others, and the ABC example showcases such a case. We very much doubt that one size will fit all — but rather claim that it is of interest to find and document these scenarios.
Michael raised the concern that intractable gradients in the Pseudo-Marginal case can be avoided by running an MCMC chain on the joint space (e.g. $(f,\theta)$ for the GP classifier). To us, however, the situation is not that clear. In many cases, the correlations between variables can cause convergence problems (see e.g. here) for the MCMC and have to be addressed by de-correlation schemes (as here), or e.g. by incorporating geometric information, which also needs fixes as Michaels’s very own one. Which is the method of choice with a particular statistical problem at hand? Which method gives the smallest estimation error (if that is the goal?) for a given problem? Estimation error per time? A thorough comparison of these different classes of algorithms in terms of performance related to problem class would help here. Most papers (including ours) only show experiments favouring their own method.

GP estimator quality.
Finally, to address Michael’s point on the consistency of the GP estimator of the density gradient: this is discussed In the original paper on the infinite dimensional exponential family. As Michael points out, higher dimensional problems are unavoidably harder, however the specific details are rather involved. First, in terms of theory: both the well-specified case (when the natural parameter is in the RKHS, Section 4), and the ill-specified case (the natural parameter is in a “reasonable”, larger class of functions, Section 5), the estimate is consistent. Consistency is obtained in various metrics, including the L² error on gradients. The rates depend on how smooth the natural parameter is (and indeed a poor choice of hyper-parameter will mean slower convergence). The key point, in regards to Michael’s question, is that the smoothness requirement becomes more restrictive as the dimension increases: see Section 4.2, “range space assumption”.
Second, in terms of practice: we have found in experiments that the infinite dimensional exponential family does perform considerably better than a kernel density estimator when the dimension increases (Section 6). In other words, our density estimator can take advantage of smoothness properties of the “true” target density to get good convergence rates. As a practical strategy for hyper-parameter choice, we cross-validate, which works well empirically despite being distasteful to Bayesians. Experiments in the KMC paper also indicate that we can scale these estimators up to dimensions in the 100s on Laptop computers (unlike most other gradient estimation techniques in HMC, e.g. the ones in your HMC & sub-sampling note, or the finite differences in Meeds et al).

## Bayesian computation: a summary of the current state, and samples backwards and forwards

Posted in Books, Statistics, University life with tags , , , , , , , , on June 25, 2015 by xi'an

“The Statistics and Computing journal gratefully acknowledges the contributions for this special issue, celebrating 25 years of publication. In the past 25 years, the journal has published innovative, distinguished research by leading scholars and professionals. Papers have been read by thousands of researchers world-wide, demonstrating the global importance of this field. The Statistics and Computing journal looks forward to many more years of exciting research as the field continues to expand.” Mark Girolami, Editor in Chief for The Statistics and Computing journal

Our joint [Peter Green, Krzysztof Łatuszyński, Marcelo Pereyra, and myself] review [open access!] on the important features of Bayesian computation has already appeared in the special 25th anniversary issue of Statistics & Computing! Along with the following papers

which means very good company, indeed! And happy B’day to Statistics & Computing!

## arXiv frenzy

Posted in R, Statistics, University life with tags , , , , , , on June 23, 2015 by xi'an

In the few past days, there has been so many arXiv postings of interest—presumably the NIPS submission effect!—that I cannot hope to cover them in the coming weeks! Hopefully, some will still come out on the ‘Og in a near future:

• Scalable Approximations of Marginal Posteriors in Variable Selection by Willem van den Boom, Galen Reeves, David B. Dunson
• The MCMC split sampler: A block Gibbs sampling scheme for latent Gaussian models by Óli Páll Geirsson, Birgir Hrafnkelsson, Daniel Simpson, Helgi Sigurðarson [also deserves a special mention for gathering only ***son authors!]
• Bayesian Nonparametric Modeling of Higher Order Markov Chains by Abhra Sarkar, David B. Dunson
• Convergence of Sequential Quasi-Monte Carlo Smoothing Algorithms by Mathieu Gerber, Nicolas Chopin
• Robust Bayesian inference via coarsening by Jeffrey W. Miller, David B. Dunson
• Expectation Particle Belief Propagation by Thibaut Lienart, Yee Whye Teh, Arnaud Doucet
• arXiv:1506.05860: Variational Gaussian Copula Inference by Shaobo Han, Xuejun Liao, David B. Dunson, Lawrence Carin
• arXiv:1506.05855: The Frequentist Information Criterion (FIC): The unification of information-based and frequentist inference by Colin H. LaMont, Paul A. Wiggins
• arXiv:1506.05757: Bayesian Inference for the Multivariate Extended-Skew Normal Distribution by Mathieu Gerber, Florian Pelgrin
• arXiv:1506.05741: Accelerated dimension-independent adaptive Metropolis by Yuxin Chen, David Keyes, Kody J.H. Law, Hatem Ltaief
• arXiv:1506.05269: Bayesian Survival Model based on Moment Characterization by Julyan Arbel, Antonio Lijoi, Bernardo Nipoti
• arXiv:1506.04778: Fast sampling with Gaussian scale-mixture priors in high-dimensional regression by Anirban Bhattacharya, Antik Chakraborty, Bani K. Mallick
• arXiv:1506.04416: Bayesian Dark Knowledge by Anoop Korattikara, Vivek Rathod, Kevin Murphy, Max Welling [a special mention for this title!]
• arXiv:1506.03693: Optimization Monte Carlo: Efficient and Embarrassingly Parallel Likelihood-Free Inference by Edward Meeds, Max Welling
• arXiv:1506.03074: Variational consensus Monte Carlo by Maxim Rabinovich, Elaine Angelino, Michael I. Jordan
• arXiv:1506.02564: Gradient-free Hamiltonian Monte Carlo with Efficient Kernel Exponential Families by Heiko Strathmann, Dino Sejdinovic, Samuel Livingstone, Zoltan Szabo, Arthur Gretton [comments coming soon!]