approximate maximum likelihood estimation using data-cloning ABC
“By accepting of having obtained a poor approximation to the posterior, except for the location of its main mode, we switch to maximum likelihood estimation.”
Presumably the first paper ever quoting from the ‘Og! Indeed, Umberto Picchini arXived a paper about a technique merging ABC with prior feedback (rechristened data cloning by S. Lele), where a maximum likelihood estimate is produced by an ABC-MCMC algorithm. For state-space models. This relates to an earlier paper by Fabio Rubio and Adam Johansen (Warwick), who also suggested using ABC to approximate the maximum likelihood estimate. Here, the idea is to use an increasing number of replicates of the latent variables, as in our SAME algorithm, to spike the posterior around the maximum of the (observed) likelihood. An ABC version of this posterior returns a mean value as an approximate maximum likelihood estimate.
“This is a so-called “likelihood-free” approach [Sisson and Fan, 2011], meaning that knowledge of the complete expression for the likelihood function is not required.”
The above remark is sort of inappropriate in that it applies to a non-ABC setting where the latent variables are simulated from the exact marginal distributions, that is, unconditional on the data, and hence their density cancels in the Metropolis-Hastings ratio. This pre-dates ABC by a few years, since this was an early version of particle filter.
“In this work we are explicitly avoiding the most typical usage of ABC, where the posterior is conditional on summary statistics of data S(y), rather than y.”
Another point I find rather negative in that, for state-space models, using the entire time-series as a “summary statistic” is unlikely to produce a good approximation.
The discussion on the respective choices of the ABC tolerance δ and on the prior feedback number of copies K is quite interesting, in that Umberto Picchini suggests setting δ first before increasing the number of copies. However, since the posterior gets more and more peaked as K increases, the consequences on the acceptance rate of the related ABC algorithm are unclear. Another interesting feature is that the underlying MCMC proposal on the parameter θ is an independent proposal, tuned during the warm-up stage of the algorithm. Since the tuning is repeated at each temperature, there are some loose ends as to whether or not it is a genuine Markov chain method. The same question arises when considering that additional past replicas need to be simulated when K increases. (Although they can be considered as virtual components of a vector made of an infinite number of replicas, to be used when needed.)
The simulation study involves a regular regression with 101 observations, a stochastic Gompertz model studied by Sophie Donnet, Jean-Louis Foulley, and Adeline Samson in 2010. With 12 points. And a simple Markov model. Again with 12 points. While the ABC-DC solutions are close enough to the true MLEs whenever available, a comparison with the cheaper ABC Bayes estimates would have been of interest as well.
August 23, 2016 at 10:16 pm
A major revision has just been accepted on Computational Statisics and Data Analysis (and is now available on arXiv). Your comment on not using summaries has been taken onboard, and now we introduce summary statistics in all examples. Also, we added a regression regularization step (a-la Beaumont) to construct a more informative independence sampler. Then an example using the g-and-k distributions has been added, in place of the perhaps not-too interesting polynomial regression study.
Comparisons with ABC-MCMC are also reported and we give some insight (see the final Summary section) on how to exploit our data-cloning ABC for when model simulations are expensive. The g-and-k example also show how the Beaumont’s regularization allows rapid increase in the clones number. Interestingly, when cloning is “cheap” (e.g. by using a carefully vectorized code) this translate in a reduced number of (ABC)MCMC iterations which are overall less expensive than a regular ABC-MCMC.
August 24, 2016 at 11:56 pm
Thank you for the update, Umberto, and for referring to our 1993, 1998 and 2002 papers as predating data cloning! And for accounting for my earlier comments.
June 2, 2015 at 10:52 pm
Thanks Xi’An for this post.
Yes of course when I say that “we are explicitly avoiding the most typical usage of ABC, where the posterior is conditional on summary statistics” this shouldn’t be read as if I am actually claiming that I managed to get around the usage of summary statistics. It’s just that I didn’t want to explicitly consider that additional layer of approximation in the exposition.
But since in some (most) cases usage of summaries is a necessary approach, I guess in a next revision I should emphasize this fact and perhaps point readers to, say, the semi-automatic approach by Fearnhead & Prangle, if not actually using it!
Yes the tuning is repeated at each temperature, in a way that does not necessarily preserve Markovianity *over the whole simulation* but…when tuning is performed — and this happens only in those iterations where the temperature is modified (i.e. the number of clones gets increased) — then we are targeting the posterior corresponding to that specific temperature. So, can’t we just consider the iterations when the tuning is performed (iterations s_1,…,s_q in my notation) as the start of corresponding chains? So the chain produced between iterations (s_j : s_{j+1)} targets the posterior corresponding to that temperature, and so it is consistent with that specific target. When the temperature is increased, the corresponding chain will target the posterior for that specific (modified) temperature and so on…
So I see it as just a possibly very long burn-in to try to approach the final distribution of interest, which is the one with the smallest threshold (delta) and the largest number of clones (K), and once we get there then K and delta are held constant. At this point the long “burn in” has ended and there will be no further adaptation. It is only this last part which is used to produce inferential results (the one without further adaptation), hence it should not cause problem.
I must admit that I don’t quite get the comment on the sentence “This is a so-called “likelihood-free” approach [Sisson and Fan, 2011], meaning that knowledge of the complete expression for the likelihood function is not required.” Could you please clarify further?
Thanks again for the insightful post!