## An alternative proof of DA convergence

**I** came across a curio today while looking at recent postings on arXiv, namely a different proof of the convergence of the Data Augmentation algorithm, more than twenty years after it was proposed by Martin Tanner and Wing Hung Wong in a 1987 JASA paper… The convergence under the positivity condition is of course a direct consequence of the ergodic theorem, as shown for instance by Tierney (1994), but the note by Yaming Yu uses instead a Kullback divergence

and shows as Liu, Wong and Kong do for the variance (Biometrika, 1994) that this divergence is monotonically decreasing in *t*. The proof is interesting in that only functional (i.e., non-ergodic) arguments are used, even though I am a wee surprised at ** IEEE Transactions on Information Theory** publishing this type of arcane mathematics… Note that the above divergence is the “wrong” one in that it measures the divergence from , not from . The convergence thus involves a sequence of divergences rather than a single one. (Of course, this has no consequence on the corollary that the total variation distance goes to zero.)

September 17, 2009 at 3:55 pm

I believe the proof of Lemma 3.1 relies on the fact that and .

For the case odd, we have (using equation (1))

which is equal to

since the factor

September 17, 2009 at 4:12 pm

Yes, I also believe this is

thereason why this equality works!September 17, 2009 at 12:45 pm

I had a further look at the equality of Lemma 3.1, following The Dude’s comments, and I am also unable to prove it:

involves a integral against instead of … I have the same trouble with equation (3). But looking further at the specific shape of the ‘s, I think there is a simplification of the conditionals in the ratio within the logarithm: for one parity of ,

and

equal to . This is a very special feature of the Data Augmentation algorithm.

September 17, 2009 at 10:42 am

Well, I would rather agree with xi’an there : the form of divergence used here is not the form used in ML estimation which is just the opposite way (expectation is always taken with respect to the actual , fixed, distribution not the empirical one).

Whether one would get somewhere using the divergence in the opposite direction is an open question but looking at Cover and Thomas Eq. (2.125) (page 36), one sees that the actual result is

where and are two different initial distributions for the chain. Hence, is also decreasing but that doesn’t mean that it goes to zero…

Looking at the paper by Yu, I had some problem with Lemma 3.1 whose “proof is simple and hence omitted”! In fact the first unnumbered expression in page 36 of Cover and Thomas gives the general expression for the difference between the two divergences at successive time instants (which incidentally involve time-reversed kernels). It hard to imagine that it can be expressed as something that does not depend on . This is certainly not true for MC in general, not even for \pi-reversible ones. So, if anyone sees why this is true…

September 17, 2009 at 1:44 am

Dear Xi’an,

Just two quick comments on your interesting comment regarding this paper, which is itself a comment on the data augmentation algorithm :-)

1. “divergence is monotonically decreasing in t”: I wouldn’t attribute this fact to the Yu paper. It is a general phenomenon for Markov chains.

2. “Note that the above divergence is the ‘wrong’ one”: I’d argue that the other divergence, i.e., the one from pi to p^(t), is the “wrong” one. For one thing, I doubt the proof of the theorem would go through if the other divergence were used. Secondly, when you express the maximum likelihood principle as minimizing the divergence between the empirical distribution and the theoretical distribution, you use the divergence from the empirical to the theoretical (data to model), not the other way around. The same thing here.

September 17, 2009 at 6:54 am

Thanks. Yuming Yu does note the fact that the divergence is decreasing for all Markov chains, I had missed this comment. As for point 2, I completely agree this is a convenient mathematical perspective and I note the MLE/EM analogy, but, as a decision theorist looking at distances as loss functions, I find less than compelling the fact that the loss is induced by the current approximation to the target (truth) rather than by the target. In other words, it seems to me that the successive distances are commensurable, i.e. that we are not using the same metric at each step. Obviously this does not undermine the total variation result.