**H**ere are the slides I edited this week [from previous talks by Pierre and Epstein] for the INI Workshop on scalable inference, in connection with our recently completed and submitted paper on ABC with Wasserstein distances:

## Archive for empirical likelihood

## at the Isaac Newton Institute [talks]

Posted in Statistics with tags ABC algorithm, dynamic model, empirical likelihood, INI, Isaac Newton Institute, non-i.i.d. data, summary statistics, Wasserstein distance on July 7, 2017 by xi'an## BimPressioNs [BNP11]

Posted in Books, pictures, Statistics, Travel, University life, Wines with tags École Normale Supérieure, Bayesian nonparametrics, BNP11, empirical likelihood, French cheese, Hamiltonian, IHP, Noel Cressie, NPR, Paris on June 29, 2017 by xi'an**W**hile my participation to BNP 11 has so far been more at the janitor level [although not gaining George Casella’s reputation on NPR!] than at the scientific one, since we had decided in favour of the least expensive and unstaffed option for coffee breaks, to keep the registration fees at a minimum [although I would have gladly gone all the way to removing all coffee breaks!, if only because such breaks produce much garbage], I had fairly good chats at the second poster session, in particular around empirical likelihoods and HMC for discrete parameters, the first one based on the general Cressie-Read formulation and the second around the recently arXived paper of Nishimura et al., which I wanted to read. Plus many other good chats full stop, around terrific cheese platters!

This morning, the coffee breaks were much more under control and I managed to enjoy [and chair] the entire session on empirical likelihood, with absolutely fantastic talks from Nils Hjort and Art Owen (the third speaker having gone AWOL, possibly a direct consequence of Trump’s travel ban).

## fast ε-free ABC

Posted in Books, Mountains, pictures, Running, Statistics, Travel, University life with tags ABC, ABC in Edinburgh, Arthur's Seat, arXiv, Edinburgh, empirical likelihood, Gaussian mixture, neural network, non-parametrics, Scotland on June 8, 2017 by xi'an**L**ast Fall, George Papamakarios and Iain Murray from Edinburgh arXived an ABC paper on fast ε-free inference on simulation models with Bayesian conditional density estimation, paper that I missed. The idea there is to approximate the posterior density by maximising the likelihood associated with a parameterised family of distributions on θ, conditional on the associated x. The data being then the ABC reference table. The family chosen there is a mixture of K Gaussian components, which parameters are then estimated by a (Bayesian) neural network using x as input and θ as output. The parameter values are simulated from an adaptive proposal that aims at approximating the posterior better and better. As in population Monte Carlo, actually. Except for the neural network part, which I fail to understand why it makes a significant improvement when compared with EM solutions. The overall difficulty with this approach is that I do not see a way out of the curse of dimensionality: when the dimension of θ increases, the approximation to the posterior distribution of θ does deteriorate, even in the best of cases, as any other non-parametric resolution. It would have been of (further) interest to see a comparison with a most rudimentary approach, namely the one we proposed based on empirical likelihoods.

## HMC sampling in Bayesian empirical likelihood computation

Posted in Statistics with tags ABC, ABCel, approximate inference, Bayesian inference, empirical likelihood, generalised linear models, Hamiltonian Monte Carlo, leapfrog generator on March 31, 2017 by xi'anWhile working on the *Series B’log* the other day I noticed this paper by Chauduri et al. on Hamiltonian Monte Carlo and empirical likelihood: how exciting!!! Here is the abstract of the paper:

We consider Bayesian empirical likelihood estimation and develop an efficient Hamiltonian Monte Car lo method for sampling from the posterior distribution of the parameters of interest.The method proposed uses hitherto unknown properties of the gradient of the underlying log-empirical-likelihood function. We use results from convex analysis to show that these properties hold under minimal assumptions on the parameter space, prior density and the functions used in the estimating equations determining the empirical likelihood. Our method employs a finite number of estimating equations and observations but produces valid semi-parametric inference for a large class of statistical models including mixed effects models, generalized linear models and hierarchical Bayes models. We overcome major challenges posed by complex, non-convex boundaries of the support routinely observed for empirical likelihood which prevent efficient implementation of traditional Markov chain Monte Car lo methods like random-walk Metropolis–Hastings sampling etc. with or without parallel tempering. A simulation study confirms that our method converges quickly and draws samples from the posterior support efficiently. We further illustrate its utility through an analysis of a discrete data set in small area estimation.

*[The comment is reposted from Series B’log, where I wrote it first.]*

It is of particular interest for me *[disclaimer: I was not involved in the review of this paper!]* as we worked on ABC thru empirical likelihood, which is about the reverse of the current paper in terms of motivation: when faced with a complex model, we substitute an empirical likelihood version for the real thing, run simulations from the prior distribution and use the empirical likelihood as a proxy. With possible intricacies when the data is not iid (an issue we also met with Wasserstein distances.) In this paper the authors instead consider working on an empirical likelihood as their starting point and derive an HMC algorithm to do so. The idea is striking in that, by nature, an empirical likelihood is not a very smooth object and hence does not seem open to producing gradients and Hessians. As illustrated by Figure 1 in the paper . Which is so spiky at places that one may wonder at the representativity of such graphs.

I have always had a persistent worry about the ultimate validity of treating the empirical likelihood as a genuine likelihood, from the fact that it is the result of an optimisation problem to the issue that the approximate empirical distribution has a finite (data-dependent) support, hence is completely orthogonal to the true distribution. And to the one that the likelihood function is zero outside the convex hull of the defining equations…(For one thing, this empirical likelihood is always bounded by one but this may be irrelevant after all!)

The computational difficulty in handling the empirical likelihood starts with its support. Eliminating values of the parameter for which this empirical likelihood is zero amounts to checking whether zero belongs to the above convex hull. A hard (NP hard?) problem. (Although I do not understand why the authors dismiss the token observations of Owen and others. The argument that Bayesian analysis does more than maximising a likelihood seems to confuse the empirical likelihood as a product of a maximisation step with the empirical likelihood as a function of the parameter that can be used as any other function.)

In the simple regression example (pp.297-299), I find the choice of the moment constraints puzzling, in that they address the mean of the white noise (zero) and the covariance with the regressors (zero too). Puzzling because my definition of the regression model is conditional on the regressors and hence does not imply anything on their distribution. In a sense this is another model. But I also note that the approach focus on the distribution of the reconstituted white noises, as we did in the PNAS paper. (The three examples processed in the paper are all simple and could be processed by regular MCMC, thus making the preliminary step of calling for an empirical likelihood somewhat artificial unless I missed the motivation. The paper also does not seem to discuss the impact of the choice of the moment constraints or the computing constraints involved by a function that is itself the result of a maximisation problem.)

A significant part of the paper is dedicated to the optimisation problem and the exclusion of the points on the boundary. Which sounds like a non-problem in continuous settings. However, this appears to be of importance for running an HMC as it cannot evade the support (without token observations). On principle, HMC should not leave this support since the gradient diverges at the boundary, but in practice the leapfrog approximation may lead the path outside. I would have (naïvely?) suggested to reject moves when this happens and start again but the authors consider that proper choices of the calibration factors of HMC can avoid this problem. Which seems to induce a practical issue by turning the algorithm into an adaptive version.

As a last point, I would have enjoyed seeing a comparison of the performances against our (A)BCel version, which would have been straightforward to implement in the simple examples handled by the paper. (This could be a neat undergraduate project for next year!)

## Bayesian empirical likelihood

Posted in Books, pictures, Statistics with tags Bayes factor, candidate approximation, Chib's approximation, Chib-Jeliazkov representation, empirical likelihood, exponential tilting, LAN on July 21, 2016 by xi'an**S**id Chib, Minchul Shin, and Anna Simoni (CREST) recently arXived a paper entitled “Bayesian Empirical Likelihood Estimation and Comparison of Moment Condition Models“. That Sid mentioned to me in Sardinia. The core notion is related to earlier Bayesian forays into empirical likelihood pseudo-models, like Lazar (2005) or our PNAS paper with Kerrie Mengersen and Pierre Pudlo. Namely to build a pseudo-likelihood using empirical likelihood principles and to derive the posterior associated with this pseudo-likelihood. Some novel aspects are the introduction of tolerance (nuisance) extra-parameters when some constraints do not hold, a maximum entropy (or exponentially tilted) representation of the empirical likelihood function, and a Chib-Jeliazkov representation of the marginal likelihood. The authors obtain a Bernstein-von Mises theorem under correct specification. Meaning convergence. And another one under misspecification.

While the above Bernstein-von Mises theory is somewhat expected (if worth deriving) in the light of frequentist consistency results, the paper also considers a novel and exciting aspect, namely to compare models (or rather moment restrictions) by Bayes factors derived from empirical likelihoods. A grand (encompassing) model is obtained by considering all moment restrictions at once, which first sounds like *more* restricted, except that the extra-parameters are there to monitor constraints that actually hold. It is unclear from my cursory read of the paper whether priors on those extra-parameters can be automatically derived from a single prior. And how much they impact the value of the Bayes factor. The consistency results found in the paper do not seem to depend on the form of priors adopted for each model (for all three cases of both correctly, one correctly and none correctly specified models). Except maybe for some local asymptotic normality (LAN). Interestingly (?), the authors consider the Poisson versus Negative Binomial test we used in our testing by mixture paper. This paper is thus bringing a better view of the theoretical properties of a pseudo-Bayesian approach based on moment conditions and empirical likelihood approximations. Without a clear vision of the implementation details, from the parameterisation of the constraints (which could be tested the same way) to the construction of the prior(s) to the handling of MCMC difficulties in realistic models.

## approximate Bayesian inference

Posted in Books, pictures, Statistics, Travel, University life with tags ABC in London, alive particle filter, Bayesian Analysis, Dirichlet prior, empirical likelihood, expectation-propagation, integer time-series, pMCMC, pseudo-marginal MCMC on March 23, 2016 by xi'an**M**aybe it is just a coincidence, but both most recent issues of Bayesian Analysis have an article featuring approximate Bayesian inference. One is by Daniel Add Contact Form Graham and co-authors on Approximate Bayesian Inference for Doubly Robust Estimation, while the other one is by Chris Drovandi and co-authors from QUT on Exact and Approximate Bayesian Inference for Low Integer-Valued Time Series Models with Intractable Likelihoods. The first paper has little connection with ABC. Even though it (a) uses a lot of three letter acronyms [which does not help with speed reading] and (b) relies on moment based and propensity score models. Instead, it relies on Bayesian bootstrap, which suddenly seems to me to be rather connected with empirical likelihood! Except the weights are estimated via a Dirichlet prior instead of being optimised. The approximation lies in using the bootstrap to derive a posterior predictive. I did not spot any assessment or control of the approximation effect in the paper.

“Note that we are always using the full data so avoiding the need to choose a summary statistic” (p.326)

The second paper connects pMCMC with ABC. Plus pseudo-marginals on the side! And even simplified reversible jump MCMC!!! I am far from certain I got every point of the paper, though, especially the notion of dimension reduction associated with *this* version of reversible jump MCMC. It may mean that latent variables are integrated out in approximate (marginalised) likelihoods [as explicated in Andrieu and Roberts (2009)].

“The difference with the common ABC approach is that we match on observations one-at-a-time” (p.328)

The model that the authors study is an integer value time-series, like the INAR(p) model. Which integer support allows for a non-zero probability of exact matching between simulated and observed data. One-at-a-time as indicated in the above quote. And integer valued tolerances like ε=1 otherwise. In the case auxiliary variables are necessary, the authors resort to the alive particle filter of Jasra et al. (2013), which main point is to produce an unbiased estimate of the (possibly approximate) likelihood, to be exploited by pseudo-marginal techniques. However, unbiasedness sounds less compelling when moving to approximate methods, as illustrated by the subsequent suggestion to use a more stable estimate of the log-likelihood. In fact, when the tolerance ε is positive, the pMCMC acceptance probability looks quite close to an ABC-MCMC probability when relying on several pseudo-data simulations. Which is unbiased for the “right” approximate target. A fact that may actually holds for *all* ABC algorithms. One quite interesting aspect of the paper is its reflection about the advantage of pseudo-marginal techniques for RJMCMC algorithms since they allow for trans-dimension moves to be simplified, as they consider marginals on the space of interest. Up to this day, I had not realised Andrieu and Roberts (2009) had a section on this aspect… I am still unclear about the derivation of the posterior probabilities of the models under comparison, unless it is a byproduct of the RJMCMC algorithm. A last point is that, for some of the Markov models used in the paper, the pseudo observations can be produced as a random one-time move away from the current true observation, which makes life much easier for ABC and explain why exact simulations can sometimes be produced. (*A side note:* the authors mention on p.326 that EP is only applicable when the posterior is from an exponential family, while my understanding is that it uses an exponential family to approximate the true posterior.)

## bootstrap(ed) likelihood for ABC

Posted in pictures, Statistics with tags ABCel, BCbl, bootstrap, bootstrap likelihood, empirical likelihood, Ising model, Madrid, summary statistics, Universidad Carlos III de Madrid on November 6, 2015 by xi'an**T**his recently arXived paper by Weixuan Zhu , Juan Miguel Marín, and Fabrizio Leisen proposes an alternative to our empirical likelihood ABC paper of 2013, or BCel. Besides the mostly personal appeal for me to report on a Juan Miguel Marín working [in Madrid] on ABC topics, along my friend Jean-Michel Marin!, this paper is another entry on ABC that connects with yet another statistical perspective, namely bootstrap. The proposal, called BCbl, is based on a reference paper by Davison, Hinkley and Worton (1992) which defines a *bootstrap likelihood*, a notion that relies on a double-bootstrap step to produce a non-parametric estimate of the distribution of a given estimator of the parameter θ. This estimate includes a smooth curve-fitting algorithm step, for which little description is available from the current paper. The bootstrap non-parametric substitute then plays the role of the actual likelihood, with no correction for the substitution just as in our BCel. Both approaches are convergent, with Monte Carlo simulations exhibiting similar or even identical convergence speeds although [unsurprisingly!] no deep theory is available on the comparative advantage.

An important issue from my perspective is that, while the empirical likelihood approach relies on a choice of identifying constraints that strongly impact the numerical value of the likelihood approximation, the bootstrap version starts directly from a subjectively chosen estimator of θ, which may also impact the numerical value of the likelihood approximation. In some ABC settings, finding a primary estimator of θ may be a real issue or a computational burden. Except when using a preliminary ABC step as in semi-automatic ABC. This would be an interesting crash-test for the BCbl proposal! (This would not necessarily increase the computational cost by a large amount.) In addition, I am not sure the method easily extends to larger collections of summary statistics as those used in ABC, in particular because it necessarily relies on non-parametric estimates, only operating in small enough dimensions where smooth curve-fitting algorithms can be used. Critically, the paper only processes examples with a few parameters.

The comparisons between BCel and BCbl that are produced in the paper show some gain towards BCbl. Obviously, it depends on the respective calibrations of the non-parametric methods and of regular ABC, as well as on the available computing time. I find the population genetic example somewhat puzzling: The paper refers to our composite likelihood to set the moment equations. Since this is a pseudo-likelihood, I wonder how the authors do select their parameter estimates in the double-bootstrap experiment. And for the Ising model, it is not straightforward to conceive of a bootstrap algorithm on an Ising model: (a) how does one subsample pixels and (b) what are the validity guarantees for the estimation procedure.