## HMC sampling in Bayesian empirical likelihood computation

Posted in Statistics with tags , , , , , , , on March 31, 2017 by xi'an

While working on the Series B’log the other day I noticed this paper by Chauduri et al. on Hamiltonian Monte Carlo and empirical likelihood: how exciting!!! Here is the abstract of the paper:

We consider Bayesian empirical likelihood estimation and develop an efficient Hamiltonian Monte Car lo method for sampling from the posterior distribution of the parameters of interest.The method proposed uses hitherto unknown properties of the gradient of the underlying log-empirical-likelihood function. We use results from convex analysis to show that these properties hold under minimal assumptions on the parameter space, prior density and the functions used in the estimating equations determining the empirical likelihood. Our method employs a finite number of estimating equations and observations but produces valid semi-parametric inference for a large class of statistical models including mixed effects models, generalized linear models and hierarchical Bayes models. We overcome major challenges posed by complex, non-convex boundaries of the support routinely observed for empirical likelihood which prevent efficient implementation of traditional Markov chain Monte Car lo methods like random-walk Metropolis–Hastings sampling etc. with or without parallel tempering. A simulation study confirms that our method converges quickly and draws samples from the posterior support efficiently. We further illustrate its utility through an analysis of a discrete data set in small area estimation.

[The comment is reposted from Series B’log, where I wrote it first.]

It is of particular interest for me [disclaimer: I was not involved in the review of this paper!] as we worked on ABC thru empirical likelihood, which is about the reverse of the current paper in terms of motivation: when faced with a complex model, we substitute an empirical likelihood version for the real thing, run simulations from the prior distribution and use the empirical likelihood as a proxy. With possible intricacies when the data is not iid (an issue we also met with Wasserstein distances.) In this paper the authors instead consider working on an empirical likelihood as their starting point and derive an HMC algorithm to do so. The idea is striking in that, by nature, an empirical likelihood is not a very smooth object and hence does not seem open to producing gradients and Hessians. As illustrated by Figure 1 in the paper . Which is so spiky at places that one may wonder at the representativity of such graphs.

I have always had a persistent worry about the ultimate validity of treating the empirical likelihood as a genuine likelihood, from the fact that it is the result of an optimisation problem to the issue that the approximate empirical distribution has a finite (data-dependent) support, hence is completely orthogonal to the true distribution. And to the one that the likelihood function is zero outside the convex hull of the defining equations…(For one thing, this empirical likelihood is always bounded by one but this may be irrelevant after all!)

The computational difficulty in handling the empirical likelihood starts with its support. Eliminating values of the parameter for which this empirical likelihood is zero amounts to checking whether zero belongs to the above convex hull. A hard (NP hard?) problem. (Although I do not understand why the authors dismiss the token observations of Owen and others. The argument that Bayesian analysis does more than maximising a likelihood seems to confuse the empirical likelihood as a product of a maximisation step with the empirical likelihood as a function of the parameter that can be used as any other function.)

In the simple regression example (pp.297-299), I find the choice of the moment constraints puzzling, in that they address the mean of the white noise (zero) and the covariance with the regressors (zero too). Puzzling because my definition of the regression model is conditional on the regressors and hence does not imply anything on their distribution. In a sense this is another model. But I also note that the approach focus on the distribution of the reconstituted white noises, as we did in the PNAS paper. (The three examples processed in the paper are all simple and could be processed by regular MCMC, thus making the preliminary step of calling for an empirical likelihood somewhat artificial unless I missed the motivation. The paper also does not seem to discuss the impact of the choice of the moment constraints or the computing constraints involved by a function that is itself the result of a maximisation problem.)

A significant part of the paper is dedicated to the optimisation problem and the exclusion of the points on the boundary. Which sounds like a non-problem in continuous settings. However, this appears to be of importance for running an HMC as it cannot evade the support (without token observations). On principle, HMC should not leave this support since the gradient diverges at the boundary, but in practice the leapfrog approximation may lead the path outside. I would have (naïvely?) suggested to reject moves when this happens and start again but the authors consider that proper choices of the calibration factors of HMC can avoid this problem. Which seems to induce a practical issue by turning the algorithm into an adaptive version.

As a last point, I would have enjoyed seeing a comparison of the performances against our (A)BCel version, which would have been straightforward to implement in the simple examples handled by the paper. (This could be a neat undergraduate project for next year!)

## Statistical rethinking [book review]

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , , , on April 6, 2016 by xi'an

Statistical Rethinking: A Bayesian Course with Examples in R and Stan is a new book by Richard McElreath that CRC Press sent me for review in CHANCE. While the book was already discussed on Andrew’s blog three months ago, and [rightly so!] enthusiastically recommended by Rasmus Bååth on Amazon, here are the reasons why I am quite impressed by Statistical Rethinking!

“Make no mistake: you will wreck Prague eventually.” (p.10)

While the book has a lot in common with Bayesian Data Analysis, from being in the same CRC series to adopting a pragmatic and weakly informative approach to Bayesian analysis, to supporting the use of STAN, it also nicely develops its own ecosystem and idiosyncrasies, with a noticeable Jaynesian bent. To start with, I like the highly personal style with clear attempts to make the concepts memorable for students by resorting to external concepts. The best example is the call to the myth of the golem in the first chapter, which McElreath uses as an warning for the use of statistical models (which almost are anagrams to golems!). Golems and models [and robots, another concept invented in Prague!] are man-made devices that strive to accomplish the goal set to them without heeding the consequences of their actions. This first chapter of Statistical Rethinking is setting the ground for the rest of the book and gets quite philosophical (albeit in a readable way!) as a result. In particular, there is a most coherent call against hypothesis testing, which by itself justifies the title of the book. Continue reading

## capture-recapture homeless deaths

Posted in Statistics, Travel, University life with tags , , , , , , , on August 28, 2014 by xi'an

In the newspaper I grabbed in the corridor to my plane today (flying to Bristol to attend the SuSTaIn image processing workshop on “High-dimensional Stochastic Simulation and Optimisation in Image Processing” where I was kindly invited and most readily accepted the invitation), I found a two-page entry on estimating the number of homeless deaths using capture-recapture. Besides the sheer concern about the very high mortality rate among homeless persons (expected lifetime, 48 years; around 7000 deaths in France between 2008 and 2010) and the dreadful realisation that there are an increasing number of kids dying in the streets, I was obviously interested in this use of capture-recapture methods as I had briefly interacted with researchers from INED working on estimating the number of (living) homeless persons about 15 years ago. Glancing at the original paper once I had landed, there was alas no methodological innovation in the approach, which was based on the simplest maximum likelihood estimate. I wonder whether or not more advanced models and [Bayesian] methods of inference could [or should] be used on such data. Like introducing covariates in the process. For instance, when conditioning the probability of (cross-)detection on the cause of death.

## reading classics (#4,5,6)

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , , , , , on December 9, 2013 by xi'an

This week, thanks to a lack of clear instructions (from me) to my students in the Reading Classics student seminar, four students showed up with a presentation! Since I had planned for two teaching blocks, three of them managed to fit within the three hours, while the last one nicely accepted to wait till next week to present a paper by David Cox…

The first paper discussed therein was A new look at the statistical model identification, written in 1974 by Hirotugu Akaike. And presenting the AIC criterion. My student Rozan asked to give the presentation in French as he struggled with English, but it was still a challenge for him and he ended up being too close to the paper to provide a proper perspective on why AIC is written the way it is and why it is (potentially) relevant for model selection. And why it is not such a definitive answer to the model selection problem. This is not the simplest paper in the list, to be sure, but some intuition could have been built from the linear model, rather than producing the case of an ARMA(p,q) model without much explanation. (I actually wonder why the penalty for this model is (p+q)/T, rather than (p+q+1)/T for the additional variance parameter.) Or simulation ran on the performances of AIC versus other xIC’s…

The second paper was another classic, the original GLM paper by John Nelder and his coauthor Wedderburn, published in 1972 in Series B. A slightly easier paper, in that the notion of a generalised linear model is presented therein, with mathematical properties linking the (conditional) mean of the observation with the parameters and several examples that could be discussed. Plus having the book as a backup. My student Ysé did a reasonable job in presenting the concepts, but she would have benefited from this extra-week in including properly the computations she ran in R around the glm() function… (The definition of the deviance was somehow deficient, although this led to a small discussion during the class as to how the analysis of deviance was extending the then flourishing analysis of variance.) In the generic definition of the generalised linear models, I was also reminded of the
generality of the nuisance parameter modelling, which made the part of interest appear as an exponential shift on the original (nuisance) density.

The third paper, presented by Bong, was yet another classic, namely the FDR paper, Controlling the false discovery rate, of Benjamini and Hochberg in Series B (which was recently promoted to the should-have-been-a-Read-Paper category by the RSS Research Committee and discussed at the Annual RSS Conference in Edinburgh four years ago, as well as published in Series B). This 2010 discussion would actually have been a good start to discuss the paper in class, but Bong was not aware of it and mentioned earlier papers extending the 1995 classic. She gave a decent presentation of the problem and of the solution of Benjamini and Hochberg but I wonder how much of the novelty of the concept the class grasped. (I presume everyone was getting tired by then as I was the only one asking questions.) The slides somewhat made it look too much like a simulation experiment… (Unsurprisingly, the presentation did not include any Bayesian perspective on the approach, even though they are quite natural and emerged very quickly once the paper was published. I remember for instance the Valencia 7 meeting in Teneriffe where Larry Wasserman discussed about the Bayesian-frequentist agreement in multiple testing.)

## probit posterior mean

Posted in Statistics, University life with tags , , , on March 9, 2012 by xi'an

In a recent arXiv report, Yuzo Maruyma shows that the posterior expectation of a probit parameter has an almost closed form (under a flat prior), namely

$\mathbb{E}[\beta|X,y] = (X^TX)^{-1} X^T\{2\text{diag}(y)-I_n\}\omega(X,y)$

where ω involves the integration of two quadratic forms over the n-dimensional unit sphere. Although this does not help directly with the MCMC derivation of the full posterior, this is an interesting lemma which shows a closed proximity with the standard least square estimate in linear regression.

## understanding computational Bayesian statistics: a reply from Bill Bolstad

Posted in Books, R, Statistics, University life with tags , , , , , , , , , , , , , on October 24, 2011 by xi'an

Bill Bolstad wrote a reply to my review of his book Understanding computational Bayesian statistics last week and here it is, unedited except for the first paragraph where he thanks me for the opportunity to respond, “so readers will see that the book has some good features beyond having a “nice cover”.” (!) I simply processed the Word document into an html output and put a Read More bar in the middle as it is fairly detailed. (As indicated at the beginning of my review, I am obviously biased on the topic: thus, I will not comment on the reply, lest we get into an infinite regress!)

The target audience for this book are upper division undergraduate students and first year graduate students in statistics whose prior statistical education has been mostly frequentist based. Many will have knowledge of Bayesian statistics at an introductory level similar to that in my first book, but some will have no previous Bayesian statistics course. Being self-contained, it will also be suitable for statistical practitioners without a background in Bayesian statistics.

The book aims to show that:

1. Bayesian statistics makes different assumptions from frequentist statistics, and these differences lead to the advantages of the Bayesian approach.
2. Finding the proportional posterior is easy, however finding the exact posterior distribution is difficult in practice, even numerically, especially for models with many parameters.
3. Inferences can be based on a (random) sample from the posterior.
4. There are methods for drawing samples from the incompletely known posterior.
5. Direct reshaping methods become inefficient for models with large number of parameters.
6. We can find a Markov chain that has the long-run distribution with the same shape as the posterior. A draw from this chain after it has run a long time can be considered a random draw from the posterior
7. We have many choices in setting up a Markov chain Monte Carlo. The book shows the things that should be considered, and how problems can be detected from sample output from the chain.
8. An independent Metropolis-Hastings chain with a suitable heavy-tailed candidate distribution will perform well, particularly for regression type models. The book shows all the details needed to set up such a chain.
9. The Gibbs sampling algorithm is especially well suited for hierarchical models.

I am satisfied that the book has achieved the goals that I set out above. The title “Understanding Computational Bayesian Statistics” explains what this book is about. I want the reader (who has background in frequentist statistics) to understand how computational Bayesian statistics can be applied to models he/she is familiar with. I keep an up-to-date errata on the book website..The website also contains the computer software used in the book. This includes Minitab macros and R-functions. These were used because because they had good data analysis capabilities that could be used in conjunction with the simulations. The website also contains Fortran executables that are much faster for models containing more parameters, and WinBUGS code for the examples in the book. Continue reading

## recent arXiv postings

Posted in Statistics, University life with tags , , , , , on October 17, 2011 by xi'an

Three interesting recent arXiv postings and not enough time to read them all and in the ‘Og bind them! (Of course, comments from readers welcome!)

Formulating a statistical inverse problem as one of inference in a Bayesian model has great appeal, notably for what this brings in terms of coherence, the interpretability of regularisation penalties, the integration of all uncertainties, and the principled way in which the set-up can be elaborated to encompass broader features of the context, such as measurement error, indirect observation, etc. The Bayesian formulation comes close to the way that most scientists intuitively regard the inferential task, and in principle allows the free use of subject knowledge in probabilistic model building. However, in some problems where the solution is not unique, for example in ill-posed inverse problems, it is important to understand the relationship between the chosen Bayesian model and the resulting solution. Taking emission tomography as a canonical example for study, we present results about consistency of the posterior distribution of the reconstruction, and a general method to study convergence of posterior distributions. To study efficiency of Bayesian inference for ill-posed linear inverse problems with constraint, we prove a version of the Bernstein-von Mises theorem for nonregular Bayesian models.

(Certainly unlikely to please the member of the audience in Zürich who questioned my Bayesian credentials for considering “true” models and consistency….)

Recently, Andrieu, Doucet and Holenstein (2010) introduced a general framework for using particle filters (PFs) to construct proposal kernels for Markov chain Monte Carlo (MCMC) methods. This framework, termed Particle Markov chain Monte Carlo (PMCMC), was shown to provide powerful methods for joint Bayesian state and parameter inference in nonlinear/non-Gaussian state-space models. However, the mixing of the resulting MCMC kernels can be quite sensitive, both to the number of particles used in the underlying PF and to the number of observations in the data. In this paper we suggest alternatives to the three PMCMC methods introduced in Andrieu et al. (2010), which are much more robust to a low number of particles as well as a large number of observations. We consider some challenging inference problems and show in a simulation study that, for problems where existing PMCMC methods require around 1000 particles, the proposed methods provide satisfactory results with as few as 5 particles.

(I have not read the paper enough in-depth to be critical, however “hard” figures like 5, or 10³, are always suspicious in that they cannot carry to the general case…)

In this paper we present an algorithm for rapid Bayesian analysis that combines the benefits of nested sampling and artificial neural networks. The blind accelerated multimodal Bayesian inference (BAMBI) algorithm implements the MultiNest package for nested sampling as well as the training of an artificial neural network (NN) to learn the likelihood function. In the case of computationally expensive likelihoods, this allows the substitution of a much more rapid approximation in order to increase significantly the speed of the analysis. We begin by demonstrating, with a few toy examples, the ability of a NN to learn complicated likelihood surfaces. BAMBI’s ability to decrease running time for Bayesian inference is then demonstrated in the context of estimating cosmological parameters from WMAP and other observations. We show that valuable speed increases are achieved in addition to obtaining NNs trained on the likelihood functions for the different model and data combinations. These NNs can then be used for an even faster follow-up analysis using the same likelihood and different priors. This is a fully general algorithm that can be applied, without any pre-processing, to other problems with computationally expensive likelihood functions.

(This is primarily an astronomy paper that uses a sample produced by the nested sampling algorithm MultiNest to build a neural network instead of the model likelihood. The algorithm thus requires the likelihood to be available at some stage.)