Feedback on data cloning

Following some discussions I had last week at Banff about data cloning, I re-read the 2007 “Data cloning” paper published in Ecology Letters by Lele, Dennis, and Lutscher. Once again, I see a strong similarity with our 2002 Statistics and Computing SAME algorithm, as well as with the subsequent (and equally similar) “A multiple-imputation Metropolis version of the EM algorithm” published in Biometrika by Gaetan and Yao in 2003—Biometrika to which Arnaud and I had earlier and unsuccessfully submitted this unpublished technical report on the convergence of the SAME algorithm… (The SAME algorithm is also described in detail in the 2005 book Inference in Hidden Markov Models, Chapter 13.)

What I find most surprising with the data cloning paper is that the fundamental feature of simulated annealing, consisting in progressively increasing the power  k (i.e. decreasing the temperature) in the pseudo-posterior

\pi_k(\theta|y) \propto \pi(\theta) \ell(\theta|y)^k,

is abandoned by the authors, who prefer to concentrate on high powers k right away. While using a very large power is indeed leading to a Bayes estimate that is close to the MLE, freezing the power once for all seems to be the “wrong idea” in that this removes the dynamic features of a simulated annealing random walk that first explores the whole space and then progressively focus on the highest modes, achieving convergence if the cooling is slow enough (as established in our 2001 unpublished technical report). In other words, if k is “large enough”, the Metropolis algorithm will face difficulties in the exploration of the parameter space, and hence in the subsequent discovery of the global modes, while, if k is “too small”, there is no certainty that the algorithm will identify the right mode. The practical implementation thus requires using an increasing sequence of k‘s, which is very demanding on computing time, especially when k is large, and thus cancels somehow the appeal of the method. While Lele, Dennis and Lutscher argue that the posterior distribution \pi_k(\theta|y) converges to a normal with variance related to the Fisher information, Gaetan and Yao (more accurately) show that the limiting distribution is a sum of point masses at the global modes of the likelihood. (I also stress that the asymptotic variance provided by data cloning in regular models has no direct connection with the intrinsic error of the MLE for the sample under study, because it is asymptotic. And that the suggestion that the Metropolis-Hastings algorithm is run with a proposal from the prior is certainly slowing down the exploration.) Coincidently, another “clone” of the SAME algorithm, entitled “MCMC maximum likelihood for latent state models” and written by Jacquier, Johannes and Polson, which appeared the same year (2007) in the Journal of Econometrics, similarly abstains from using a dynamic simulated annealing increase in the power, running a full MCMC algorithm for several (increasing) values of k

4 Responses to “Feedback on data cloning”

  1. […] the marginal structure to design an automated simulated annealing algorithm. (This algorithm was later re-invented under other names.) Birge and Polson have a similar version involving slice […]

  2. […] the algorithm. My first idea of a tempering mechanism in a likelihood-free setting was to replicate our SAME algorithm (Doucet, Godsill, and Robert, 2004), by creating Tj copies of the [pseudo-]observations to mimic […]

  3. Umberto Says:

    Dear Prof. Robert,

    it is indeed surprising not to let k free to vary through the mcmc simulations as this is what we would expect to do in a simulated annealing procedure. However… I have a doubt regarding such approach (i.e. update k dynamically). If k is let free to grow, say linearly, through the mcmc algorithm, then the “likelihood” for cloned-data at the denominator of the acceptance probability in Lele et al. would be based on k clones whereas the likelihood corresponding to the proposal on the numerator would be computed using say k+1 cloned-data (or more). Wouldn’t this represent an unfair comparison between proposal and last chain value? This way the likelihhod for the proposal would be based on a larger amount of “data”.
    I would appreciate your thought on this.

    Thank you very much for running such an interesting blog (and for your books of course!).

    • Umberto: thanks for the comments and the thanks!
      My feelings about the validation of a slowly varying k are that (a) you can correct for changing the power along the iterations, this is for instance what Radford Neal did in his tempered (1996) and annealed (2001) papers and (b) the method does not aim at correctly simulate from the powered distribution(s), rather at reaching the proper collection of global maxima. Hence incorrectly simulating from the powered target does not matter much in the end.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: