## easy-to-use empirical likelihood ABC

Posted in Statistics, University life with tags , , , , , , , on October 23, 2018 by xi'an

A newly arXived paper from a group of researchers at NUS I wish we had discussed when I was there last month. As we wrote this empirical ABCe paper in PNAS with Kerrie Mengersen and Pierre Pudlo in 2012. Plus the SAME paper with Arnaud Doucet and Simon Godsill ten years earlier, which the authors prefer to call data cloning in continuation of the more recent Lele et al. (2007). They could actually have used my original denomination of prior feedback (1992? I remember presenting the idea at Camp Casella in Cornell that summer) as well! Actually, I am not certain invoking prior feedback is quite necessary since this is a form of simulated method of moments as well.

Now, did we really assume that some moments of the distribution were analytically available, although the likelihood was not?! Even before going through the paper, it dawned on me that these theoretical moments could have been simulated instead, since the model is a generative one: for a given parameter value, a direct Monte Carlo approximation to the exact moment can be produced and can serve as a constraint for the empirical likelihood definition. I am surprised and aggrieved that we would not think of this empirical likelihood version of a method of moments. Which is central to the current paper. In the sense that, were the parameter exact, the differences between the moments based on the actual data x⁰ and the moments based on m replicas of the simulated data x¹,x²,… have mean zero, meaning the moment constraint is immediately available. Meaning an empirical likelihood is easily constructed, replacing the actual likelihood in an MCMC scheme, albeit at a rather high computing cost. Congratulations to the authors for uncovering this possibility that we missed!

“The summary statistics in this example were judiciously chosen.”

One point in the paper on which I disagree with the authors is the argument that MCMC sampling based on an empirical likelihood can be seen as an implementation of the pseudo-marginal Metropolis-Hastings method. The major difference in my opinion is that there is no unbiasedness here (and no generic result that indicates convergence to the exact posterior as the number of simulations grows to infinity). The other point unclear to me is about the selection of summaries [or moments] for implementing the method, which seems to be based on their performances in the subsequent estimation, performances that are hard to assess properly in intractable likelihood cases. In the last example of stereological extremes (not covered in our paper), for instance, the output is compared with the parallel synthetic likelihood result.

## interdependent Gibbs samplers

Posted in Books, Statistics, University life with tags , , , , , , on April 27, 2018 by xi'an

Mark Kozdoba and Shie Mannor just arXived a paper on an approach to accelerate a Gibbs sampler. With title “interdependent Gibbs samplers“. In fact, it presents rather strong similarities with our SAME algorithm. More of the same, as Adam Johanssen (Warwick) entitled one of his papers! The paper indeed suggests multiplying replicas of latent variables (e.g., an hidden path for an HMM) in an artificial model. And as in our 2002 paper, with Arnaud Doucet and Simon Godsill, the focus here is on maximum likelihood estimation (of the genuine parameters, not of the latent variables). Along with argument that the resulting pseudo-posterior is akin to a posterior with a powered likelihood. And a link with the EM algorithm. And an HMM application.

“The generative model consist of simply sampling the parameters ,  and then sampling m independent copies of the paths”

If anything this proposal is less appealing than SAME because it aims directly at the powered likelihood, rather than utilising an annealed sequence of powers that allows for a primary exploration of the whole parameter space before entering the trapping vicinity of a mode. Which makes me fail to catch the argument from the authors that this improves Gibbs sampling, as a more acute mode has on the opposite the dangerous feature of preventing visits to other modes. Hence the relevance to resort to some form of annealing.

As already mused upon in earlier posts, I find it most amazing that this technique has been re-discovered so many times, both in statistics and in adjacent fields. The idea of powering the likelihood with independent copies of the latent variables is obviously natural (since a version pops up every other year, always under a different name), but earlier versions should eventually saturate the market!

## approximate maximum likelihood estimation using data-cloning ABC

Posted in Books, Statistics, University life with tags , , , , , , , , on June 2, 2015 by xi'an

“By accepting of having obtained a poor approximation to the posterior, except for the location of its main mode, we switch to maximum likelihood estimation.”

Presumably the first paper ever quoting from the ‘Og! Indeed, Umberto Picchini arXived a paper about a technique merging ABC with prior feedback (rechristened data cloning by S. Lele), where a maximum likelihood estimate is produced by an ABC-MCMC algorithm. For state-space models. This relates to an earlier paper by Fabio Rubio and Adam Johansen (Warwick), who also suggested using ABC to approximate the maximum likelihood estimate. Here, the idea is to use an increasing number of replicates of the latent variables, as in our SAME algorithm, to spike the posterior around the maximum of the (observed) likelihood. An ABC version of this posterior returns a mean value as an approximate maximum likelihood estimate.

“This is a so-called “likelihood-free” approach [Sisson and Fan, 2011], meaning that knowledge of the complete expression for the likelihood function is not required.”

The above remark is sort of inappropriate in that it applies to a non-ABC setting where the latent variables are simulated from the exact marginal distributions, that is, unconditional on the data, and hence their density cancels in the Metropolis-Hastings ratio. This pre-dates ABC by a few years, since this was an early version of particle filter.

“In this work we are explicitly avoiding the most typical usage of ABC, where the posterior is conditional on summary statistics of data S(y), rather than y.”

Another point I find rather negative in that, for state-space models, using the entire time-series as a “summary statistic” is unlikely to produce a good approximation.

The discussion on the respective choices of the ABC tolerance δ and on the prior feedback number of copies K is quite interesting, in that Umberto Picchini suggests setting δ first before increasing the number of copies. However, since the posterior gets more and more peaked as K increases, the consequences on the acceptance rate of the related ABC algorithm are unclear. Another interesting feature is that the underlying MCMC proposal on the parameter θ is an independent proposal, tuned during the warm-up stage of the algorithm. Since the tuning is repeated at each temperature, there are some loose ends as to whether or not it is a genuine Markov chain method. The same question arises when considering that additional past replicas need to be simulated when K increases. (Although they can be considered as virtual components of a vector made of an infinite number of replicas, to be used when needed.)

The simulation study involves a regular regression with 101 observations, a stochastic Gompertz model studied by Sophie Donnet, Jean-Louis Foulley, and Adeline Samson in 2010. With 12 points. And a simple Markov model. Again with 12 points. While the ABC-DC solutions are close enough to the true MLEs whenever available, a comparison with the cheaper ABC Bayes estimates would have been of interest as well.

## hierarchical models are not Bayesian models

Posted in Books, Kids, Statistics, University life with tags , , , , , , , on February 18, 2015 by xi'an

When preparing my OxWaSP projects a few weeks ago, I came perchance on a set of slides, entitled “Hierarchical models are not Bayesian“, written by Brian Dennis (University of Idaho), where the author argues against Bayesian inference in hierarchical models in ecology, much in relation with the previously discussed paper of Subhash Lele. The argument is the same, namely a possibly major impact of the prior modelling on the resulting inference, in particular when some parameters are hardly identifiable, the more when the model is complex and when there are many parameters. And that “data cloning” being available since 2007, frequentist methods have “caught up” with Bayesian computational abilities.

Let me remind the reader that “data cloning” means constructing a sequence of Bayes estimators corresponding to the data being duplicated (or cloned) once, twice, &tc., until the point estimator stabilises. Since this corresponds to using increasing powers of the likelihood, the posteriors concentrate more and more around the maximum likelihood estimator. And even recover the Hessian matrix. This technique is actually older than 2007 since I proposed it in the early 1990’s under the name of prior feedback, with earlier occurrences in the literature like D’Epifanio (1989) and even the discussion of Aitkin (1991). A more efficient version of this approach is the SAME algorithm we developed in 2002 with Arnaud Doucet and Simon Godsill where the power of the likelihood is increased during iterations in a simulated annealing version (with a preliminary version found in Duflo, 1996).

I completely agree with the author that a hierarchical model does not have to be Bayesian: when the random parameters in the model are analysed as sources of additional variations, as for instance in animal breeding or ecology, and integrated out, the resulting model can be analysed by any statistical method. Even though one may wonder at the motivations for selecting this particular randomness structure in the model. And at an increasing blurring between what is prior modelling and what is sampling modelling as the number of levels in the hierarchy goes up. This rather amusing set of slides somewhat misses a few points, in particular the ability of data cloning to overcome identifiability and multimodality issues. Indeed, as with all simulated annealing techniques, there is a practical difficulty in avoiding the fatal attraction of a local mode using MCMC techniques. There are thus high chances data cloning ends up in the “wrong” mode. Moreover, when the likelihood is multimodal, it is a general issue to decide which of the modes is most relevant for inference. In which sense is the MLE more objective than a Bayes estimate, then? Further, the impact of a prior on some aspects of the posterior distribution can be tested by re-running a Bayesian analysis with different priors, including empirical Bayes versions or, why not?!, data cloning, in order to understand where and why huge discrepancies occur. This is part of model building, in the end.

## Is non-informative Bayesian analysis dangerous for wildlife???

Posted in Books, pictures, Statistics, University life with tags , , on February 12, 2015 by xi'an

Subhash Lele recently arXived a short paper entitled “Is non-informative Bayesian analysis appropriate for wildlife management: survival of San Joaquin Kit fox and declines in amphibian populations”. (Lele has been mentioned several times on this blog in connection with his data-cloning approach that mostly clones our own SAME algorithm.)

“The most commonly used non-informative priors are either the uniform priors or the priors with very large variances spreading the probability mass almost uniformly over the entire parameter space.”

The main goal of the paper is to warn, even better “to disabuse the ecologists of the notion that there is no difference between non-informative Bayesian inference and likelihood-based inference and that the philosophical underpinnings of statistical inference are irrelevant to practice.” The argument advanced by Lele is simply that two different parametrisations should lead to two compatible priors and that, if they do not not, this exhibits an unacceptable impact of the prior modelling on the resulting inference, while likelihood-based inference [obviously] does not depend on parametrisation.

The first example in the paper is a dynamic linear model of a fox population series when using a uniform U(0,1) prior on a parameter b against a Ga(100,100) prior on -a/b. (The normal prior a is the same on both.) I do not find the opposition between the two posteriors in the least surprising as the modelling starts by assuming different supports on the parameter b. And both are highly “informative” in that there is no intrinsic constraint on b that could justify the (0,1) support, as illustrated by the second choice when b is unconstrained, varying on (-15,15) or (-0.0015,0.0015) depending on how the Ga(100,100) prior is parametrised.

The second model is even simpler as it involves one Bernoulli probability p for the observations, plus a second Bernoulli driving replicates when the first Bernoulli variate is one, i.e.,

$Y_i\sim \mathfrak{B}(p)\qquad O_{ij}|Y_i=1\sim \mathfrak{B}(q)$

and the paper opposes a uniform prior on p,q to a normal N(0,10^3) prior on the logit transforms of p and q. [With an obvious typo at the top of page 10.] As shown on the above graph, the two priors on p are immensely different, so should lead to different posteriors in a weakly informative setting as a Bernoulli experiment. Even with a few hundred individuals. A somewhat funny aspect of this study is that Lele opposes the uniform prior to the Jeffreys Be(.5,.5) prior as being “nowhere close to looking like what one would consider a non-informative prior”, without noticing that the logit parametrisation normal prior leads to an even more peaked prior…

“Even when Jeffreys prior can be computed, it will be difficult to sell this prior as an objective prior to the jurors or the senators on the committee. The construction of Jeffreys and other objective priors for multi-parameter models poses substantial mathematical difficulties.”

I find it rather surprising that a paper can be dedicated to the comparison of two arbitrary prior distributions on two fairly simplistic models towards the global conclusion that “non-informative priors neither ‘let the data speak’ nor do they correspond (even roughly) to likelihood analysis.” In this regard, the earlier critical analysis of Seaman et al., to which my PhD student Kaniav Kamary and I replied, had a broader scope.

## ABC for design

Posted in Statistics with tags , , , , , , , on August 30, 2013 by xi'an

I wrote a comment on this arXived paper on simulation based design that starts from Müller (1999) and gets an ABC perspective a while ago on my iPad when travelling to Montpellier and then forgot to download it…

Hainy, [Wener] Müller, and Wagner recently arXived a paper called “Likelihood-free Simulation-based Optimal Design“, paper which relies on ABC to construct optimal designs . Remember that [Peter] Müller (1999) uses a natural simulated annealing that is quite similar to our MAP [SAME] algorithm with Arnaud Doucet and Simon Godsill, relying on multiple versions of the data set to get to the maximum. The paper also builds upon our 2006 JASA paper with my then PhD student Billy Amzal, Eric Parent, and Frederic Bois, paper that took advantage of the then emerging particle methods to improve upon a static horizon target. While our method is sequential in that it pursues a moving target, it does not rely on the generic methodology developed by del Moral et al. (2006), where a backward kernel brings more stability to the moves. The paper also implements a version of our population Monte Carlo ABC algorithm (Beaumont et al., 2009), as a first step before an MCMC simulation. Overall, the paper sounds more like a review than like a strongly directive entry into ABC based design in that it remains quite generic. Not that I have specific suggestions, mind!, but I fear a realistic implementation (as opposed to the linear model used in the paper) would require a certain amount of calibration. There are missing references of recent papers using ABC for design, including some by Michael Stumpf I think.

I did not know about the Kuck et al. reference… Which is reproducing our 2006 approach within the del Moral framework. It uses a continuous temperature scale that I find artificial and not that useful, again a maybe superficial comment as I didn’t get very much into the paper … Just that integer powers lead to multiples of the sample and have a nice algorithmic counterpart.

## optimisation via slice sampling

Posted in Statistics with tags , , , , on December 20, 2012 by xi'an

This morning, over breakfast; I read the paper recently arXived by John Birge et Nick Polson. I was intrigued by the combination of optimisation and of slice sampling, but got a wee disappointed by the paper, in that it proposes a simple form of simulated annealing, minimising f(x) by targeting a small collection of energy functions exp{-τf(x)}. Indeed, the slice sampler is used to explore each of those targets, i.e. for a fixed temperature τ. For the four functions considered in the paper, a slice sampler can indeed be implemented, but this feature could be seen as a marginalia, given that a random walk Metropolis-Hastings algorithm could be used as a proposal mechanism in other cases. The other intriguing fact is that annealing is not used in the traditional way of increasing the coefficient τ along iterations (as in our SAME algorithm), for which convergence issues are much more intricate, but instead stays at the level where a whole (Markov) sample is used for each temperature, the outcomes being then compared in terms of localisation (and maybe for starting at the next temperature value). I did not see any discussion about the selection of the successive temperatures, which usually is a delicate issue in realistic settings, nor of the stopping rule(s) used to decide the maximum has been reached.