inefficiency of data augmentation for large samples

On Monday, James Johndrow, Aaron Smith, Natesh Pillai, and David Dunson arXived a paper on the diminishing benefits of using data augmentation for large and highly imbalanced categorical data. They reconsider the data augmentation scheme of Tanner and Wong (1987), surprisingly not mentioned, used in the first occurrences of the Gibbs sampler like Albert and Chib’s (1993) or our mixture estimation paper with Jean Diebolt (1990). The central difficulty with data augmentation is that the distribution to be simulated operates on a space that is of order O(n), even when the original distribution covers a single parameter. As illustrated by the coalescent in population genetics (and the subsequent intrusion of the ABC methodology), there are well-known cases when the completion is near to impossible and clearly inefficient (as again illustrated by the failure of importance sampling strategies on the coalescent). The paper provides spectral gaps for the logistic and probit regression completions, which are of order a power of log(n) divided by √n, when all observations are equal to one. In a somewhat related paper with Jim Hobert and Vivek Roy, we studied the spectral gap for mixtures with a small number of observations: I wonder at the existence of a similar result in this setting, when all observations stem from one component of the mixture, when all observations are one. The result in this paper is theoretically appealing, the more because the posteriors associated with such models are highly regular and very close to Gaussian (and hence not that challenging as argued by Chopin and Ridgway). And because the data augmentation algorithm is uniformly ergodic in this setting (as we established with Jean Diebolt  and later explored with Richard Tweedie). As demonstrated in the  experiment produced in the paper, when comparing with HMC and Metropolis-Hastings (same computing times?), which produce much higher effective sample sizes.

2 Responses to “inefficiency of data augmentation for large samples”

  1. jamesjohndrow Says:

    Xi’an, our thanks for the references and for your insightful comments about the paper. We will be sure to include these references when we post a new version. Radford, we agree that the intuition we give for the result is quite similar to the discussion in your slice sampling paper, and will add a mention and reference to that effect. Thanks again to you both.

    Xi’an, in response to your question about computation time per scan, we didn’t formally consider that, though qualitatively it is similar for HMC, RWMH, and Gibbs in the simple models. For the hierarchical application, HMC is quite slow, but I think a lot of this has to do with the implementation (Stan) rather than anything fundamental about the algorithm. RWMH has similar comparable computational complexity per iteration to Gibbs in this case.

  2. A related discussion is in my slice sampling paper from 2003, at

    There I discuss the multiple-auxiliary-variable algorithm of Damien, Wakefield, and Walker, explaining why it will often be very slow compared to the use of a single auxiliary variable in slice sampling. See pages 711 and 761-763.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s