Archive for data cloning

easy-to-use empirical likelihood ABC

Posted in Statistics, University life with tags , , , , , , , on October 23, 2018 by xi'an

A newly arXived paper from a group of researchers at NUS I wish we had discussed when I was there last month. As we wrote this empirical ABCe paper in PNAS with Kerrie Mengersen and Pierre Pudlo in 2012. Plus the SAME paper with Arnaud Doucet and Simon Godsill ten years earlier, which the authors prefer to call data cloning in continuation of the more recent Lele et al. (2007). They could actually have used my original denomination of prior feedback (1992? I remember presenting the idea at Camp Casella in Cornell that summer) as well! Actually, I am not certain invoking prior feedback is quite necessary since this is a form of simulated method of moments as well.

Now, did we really assume that some moments of the distribution were analytically available, although the likelihood was not?! Even before going through the paper, it dawned on me that these theoretical moments could have been simulated instead, since the model is a generative one: for a given parameter value, a direct Monte Carlo approximation to the exact moment can be produced and can serve as a constraint for the empirical likelihood definition. I am surprised and aggrieved that we would not think of this empirical likelihood version of a method of moments. Which is central to the current paper. In the sense that, were the parameter exact, the differences between the moments based on the actual data x⁰ and the moments based on m replicas of the simulated data x¹,x²,… have mean zero, meaning the moment constraint is immediately available. Meaning an empirical likelihood is easily constructed, replacing the actual likelihood in an MCMC scheme, albeit at a rather high computing cost. Congratulations to the authors for uncovering this possibility that we missed!

“The summary statistics in this example were judiciously chosen.”

One point in the paper on which I disagree with the authors is the argument that MCMC sampling based on an empirical likelihood can be seen as an implementation of the pseudo-marginal Metropolis-Hastings method. The major difference in my opinion is that there is no unbiasedness here (and no generic result that indicates convergence to the exact posterior as the number of simulations grows to infinity). The other point unclear to me is about the selection of summaries [or moments] for implementing the method, which seems to be based on their performances in the subsequent estimation, performances that are hard to assess properly in intractable likelihood cases. In the last example of stereological extremes (not covered in our paper), for instance, the output is compared with the parallel synthetic likelihood result.

approximate maximum likelihood estimation using data-cloning ABC

Posted in Books, Statistics, University life with tags , , , , , , , , on June 2, 2015 by xi'an

“By accepting of having obtained a poor approximation to the posterior, except for the location of its main mode, we switch to maximum likelihood estimation.”

Presumably the first paper ever quoting from the ‘Og! Indeed, Umberto Picchini arXived a paper about a technique merging ABC with prior feedback (rechristened data cloning by S. Lele), where a maximum likelihood estimate is produced by an ABC-MCMC algorithm. For state-space models. This relates to an earlier paper by Fabio Rubio and Adam Johansen (Warwick), who also suggested using ABC to approximate the maximum likelihood estimate. Here, the idea is to use an increasing number of replicates of the latent variables, as in our SAME algorithm, to spike the posterior around the maximum of the (observed) likelihood. An ABC version of this posterior returns a mean value as an approximate maximum likelihood estimate.

“This is a so-called “likelihood-free” approach [Sisson and Fan, 2011], meaning that knowledge of the complete expression for the likelihood function is not required.”

The above remark is sort of inappropriate in that it applies to a non-ABC setting where the latent variables are simulated from the exact marginal distributions, that is, unconditional on the data, and hence their density cancels in the Metropolis-Hastings ratio. This pre-dates ABC by a few years, since this was an early version of particle filter.

“In this work we are explicitly avoiding the most typical usage of ABC, where the posterior is conditional on summary statistics of data S(y), rather than y.”

Another point I find rather negative in that, for state-space models, using the entire time-series as a “summary statistic” is unlikely to produce a good approximation.

The discussion on the respective choices of the ABC tolerance δ and on the prior feedback number of copies K is quite interesting, in that Umberto Picchini suggests setting δ first before increasing the number of copies. However, since the posterior gets more and more peaked as K increases, the consequences on the acceptance rate of the related ABC algorithm are unclear. Another interesting feature is that the underlying MCMC proposal on the parameter θ is an independent proposal, tuned during the warm-up stage of the algorithm. Since the tuning is repeated at each temperature, there are some loose ends as to whether or not it is a genuine Markov chain method. The same question arises when considering that additional past replicas need to be simulated when K increases. (Although they can be considered as virtual components of a vector made of an infinite number of replicas, to be used when needed.)

The simulation study involves a regular regression with 101 observations, a stochastic Gompertz model studied by Sophie Donnet, Jean-Louis Foulley, and Adeline Samson in 2010. With 12 points. And a simple Markov model. Again with 12 points. While the ABC-DC solutions are close enough to the true MLEs whenever available, a comparison with the cheaper ABC Bayes estimates would have been of interest as well.

hierarchical models are not Bayesian models

Posted in Books, Kids, Statistics, University life with tags , , , , , , , on February 18, 2015 by xi'an

When preparing my OxWaSP projects a few weeks ago, I came perchance on a set of slides, entitled “Hierarchical models are not Bayesian“, written by Brian Dennis (University of Idaho), where the author argues against Bayesian inference in hierarchical models in ecology, much in relation with the previously discussed paper of Subhash Lele. The argument is the same, namely a possibly major impact of the prior modelling on the resulting inference, in particular when some parameters are hardly identifiable, the more when the model is complex and when there are many parameters. And that “data cloning” being available since 2007, frequentist methods have “caught up” with Bayesian computational abilities.

Let me remind the reader that “data cloning” means constructing a sequence of Bayes estimators corresponding to the data being duplicated (or cloned) once, twice, &tc., until the point estimator stabilises. Since this corresponds to using increasing powers of the likelihood, the posteriors concentrate more and more around the maximum likelihood estimator. And even recover the Hessian matrix. This technique is actually older than 2007 since I proposed it in the early 1990’s under the name of prior feedback, with earlier occurrences in the literature like D’Epifanio (1989) and even the discussion of Aitkin (1991). A more efficient version of this approach is the SAME algorithm we developed in 2002 with Arnaud Doucet and Simon Godsill where the power of the likelihood is increased during iterations in a simulated annealing version (with a preliminary version found in Duflo, 1996).

I completely agree with the author that a hierarchical model does not have to be Bayesian: when the random parameters in the model are analysed as sources of additional variations, as for instance in animal breeding or ecology, and integrated out, the resulting model can be analysed by any statistical method. Even though one may wonder at the motivations for selecting this particular randomness structure in the model. And at an increasing blurring between what is prior modelling and what is sampling modelling as the number of levels in the hierarchy goes up. This rather amusing set of slides somewhat misses a few points, in particular the ability of data cloning to overcome identifiability and multimodality issues. Indeed, as with all simulated annealing techniques, there is a practical difficulty in avoiding the fatal attraction of a local mode using MCMC techniques. There are thus high chances data cloning ends up in the “wrong” mode. Moreover, when the likelihood is multimodal, it is a general issue to decide which of the modes is most relevant for inference. In which sense is the MLE more objective than a Bayes estimate, then? Further, the impact of a prior on some aspects of the posterior distribution can be tested by re-running a Bayesian analysis with different priors, including empirical Bayes versions or, why not?!, data cloning, in order to understand where and why huge discrepancies occur. This is part of model building, in the end.

Is non-informative Bayesian analysis dangerous for wildlife???

Posted in Books, pictures, Statistics, University life with tags , , on February 12, 2015 by xi'an

Subhash Lele recently arXived a short paper entitled “Is non-informative Bayesian analysis appropriate for wildlife management: survival of San Joaquin Kit fox and declines in amphibian populations”. (Lele has been mentioned several times on this blog in connection with his data-cloning approach that mostly clones our own SAME algorithm.)

“The most commonly used non-informative priors are either the uniform priors or the priors with very large variances spreading the probability mass almost uniformly over the entire parameter space.”

The main goal of the paper is to warn, even better “to disabuse the ecologists of the notion that there is no difference between non-informative Bayesian inference and likelihood-based inference and that the philosophical underpinnings of statistical inference are irrelevant to practice.” The argument advanced by Lele is simply that two different parametrisations should lead to two compatible priors and that, if they do not not, this exhibits an unacceptable impact of the prior modelling on the resulting inference, while likelihood-based inference [obviously] does not depend on parametrisation.

The first example in the paper is a dynamic linear model of a fox population series when using a uniform U(0,1) prior on a parameter b against a Ga(100,100) prior on -a/b. (The normal prior a is the same on both.) I do not find the opposition between the two posteriors in the least surprising as the modelling starts by assuming different supports on the parameter b. And both are highly “informative” in that there is no intrinsic constraint on b that could justify the (0,1) support, as illustrated by the second choice when b is unconstrained, varying on (-15,15) or (-0.0015,0.0015) depending on how the Ga(100,100) prior is parametrised.

leleThe second model is even simpler as it involves one Bernoulli probability p for the observations, plus a second Bernoulli driving replicates when the first Bernoulli variate is one, i.e.,

Y_i\sim \mathfrak{B}(p)\qquad O_{ij}|Y_i=1\sim \mathfrak{B}(q)

and the paper opposes a uniform prior on p,q to a normal N(0,10^3) prior on the logit transforms of p and q. [With an obvious typo at the top of page 10.] As shown on the above graph, the two priors on p are immensely different, so should lead to different posteriors in a weakly informative setting as a Bernoulli experiment. Even with a few hundred individuals. A somewhat funny aspect of this study is that Lele opposes the uniform prior to the Jeffreys Be(.5,.5) prior as being “nowhere close to looking like what one would consider a non-informative prior”, without noticing that the logit parametrisation normal prior leads to an even more peaked prior…

“Even when Jeffreys prior can be computed, it will be difficult to sell this prior as an objective prior to the jurors or the senators on the committee. The construction of Jeffreys and other objective priors for multi-parameter models poses substantial mathematical difficulties.”

I find it rather surprising that a paper can be dedicated to the comparison of two arbitrary prior distributions on two fairly simplistic models towards the global conclusion that “non-informative priors neither ‘let the data speak’ nor do they correspond (even roughly) to likelihood analysis.” In this regard, the earlier critical analysis of Seaman et al., to which my PhD student Kaniav Kamary and I replied, had a broader scope.

SAME but different

Posted in Statistics, University life with tags , , , , , , , , , on October 27, 2014 by xi'an

MumbaiAfter several clones of our SAME algorithm appeared in the literature, it is rather fun to see another paper acknowledging the connection. SAME but different was arXived today by Zhao, Jiang and Canny. The point of this short paper is to show that the parallel implementation of SAME leads to efficient performances compared with existing standards. Since the duplicated latent variables are independent [given θ] they can be simulated in parallel. They further assume independence between the components of those latent variables. And finite support. As in document analysis. So they can sample the replicated latent variables all at once. Parallelism is thus used solely for the components of the latent variable(s). SAME is normally associated with an annealing schedule but the authors could not detect an improvement over a fixed and large number of replications. They reported gains comparable to state-of-the-art variational Bayes on two large datasets. Quite fun to see SAME getting a new life thanks to computer scientists!