**O**n Monday, James Johndrow, Aaron Smith, Natesh Pillai, and David Dunson arXived a paper on the diminishing benefits of using data augmentation for large and highly imbalanced categorical data. They reconsider the data augmentation scheme of Tanner and Wong (1987), surprisingly not mentioned, used in the first occurrences of the Gibbs sampler like Albert and Chib’s (1993) or our mixture estimation paper with Jean Diebolt (1990). The central difficulty with data augmentation is that the distribution to be simulated operates on a space that is of order O(n), even when the original distribution covers a single parameter. As illustrated by the coalescent in population genetics (and the subsequent intrusion of the ABC methodology), there are well-known cases when the completion is near to impossible and clearly inefficient (as again illustrated by the failure of importance sampling strategies on the coalescent). The paper provides spectral gaps for the logistic and probit regression completions, which are of order a power of log(n) divided by √n, when all observations are equal to one. In a somewhat related paper with Jim Hobert and Vivek Roy, we studied the spectral gap for mixtures with a small number of observations: I wonder at the existence of a similar result in this setting, when all observations stem from one component of the mixture, when all observations are one. The result in this paper is theoretically appealing, the more because the posteriors associated with such models are highly regular and very close to Gaussian (and hence not that challenging as argued by Chopin and Ridgway). And because the data augmentation algorithm is uniformly ergodic in this setting (as we established with Jean Diebolt and later explored with Richard Tweedie). As demonstrated in the experiment produced in the paper, when comparing with HMC and Metropolis-Hastings (same computing times?), which produce much higher effective sample sizes.

## Archive for probit model

## inefficiency of data augmentation for large samples

Posted in Books, pictures, Running, Statistics, Travel, University life with tags convergence of Gibbs samplers, Data augmentation, Gibbs sampling, Hamiltonian Monte Carlo, importance sampling, logit model, MCMC, Monte Carlo Statistical Methods, probit model, simulation, spectral gap on May 31, 2016 by xi'an## Leave the Pima Indians alone!

Posted in Books, R, Statistics, University life with tags ABC, Bayes factor, benchmark, Chib's approximation, CPU, diabetes, EP-ABC, expectation-propagation, Gibbs sampling, Jim Berger, logistic regression, MCMC algorithms, Monte Carlo Statistical Methods, Newton-Raphson algorithm, Pima Indians, probit model, R on July 15, 2015 by xi'an

“…our findings shall lead to us be critical of certain current practices. Specifically, most papers seem content with comparing some new algorithm with Gibbs sampling, on a few small datasets, such as the well-known Pima Indians diabetes dataset (8 covariates). But we shall see that, for such datasets, approaches that are even more basic than Gibbs sampling are actually hard to beat. In other words, datasets considered in the literature may be too toy-like to be used as a relevant benchmark. On the other hand, if ones considers larger datasets (with say 100 covariates), then not so many approaches seem to remain competitive” (p.1)

**N**icolas Chopin and James Ridgway (CREST, Paris) completed and arXived a paper they had “threatened” to publish for a while now, namely why using the Pima Indian R logistic or probit regression benchmark for checking a computational algorithm is not such a great idea! Given that I am definitely guilty of such a sin (in papers not reported in the survey), I was quite eager to read the reasons why! Beyond the debate on the worth of such a benchmark, the paper considers a wider perspective as to how Bayesian computation algorithms should be compared, including the murky waters of CPU time versus designer or programmer time. Which plays against most MCMC sampler.

As a first entry, Nicolas and James point out that the MAP can be derived by standard a Newton-Raphson algorithm when the prior is Gaussian, and even when the prior is Cauchy as it seems most datasets allow for Newton-Raphson convergence. As well as the Hessian. We actually took advantage of this property in our comparison of evidence approximations published in the Festschrift for Jim Berger. Where we also noticed the awesome performances of an importance sampler based on the Gaussian or Laplace approximation. The authors call this proposal their *gold standard*. Because they also find it hard to beat. They also pursue this approximation to its logical (?) end by proposing an evidence approximation based on the above and Chib’s formula. Two close approximations are provided by INLA for posterior marginals and by a Laplace-EM for a Cauchy prior. Unsurprisingly, the expectation-propagation (EP) approach is also implemented. What EP lacks in theoretical backup, it seems to recover in sheer precision (in the examples analysed in the paper). And unsurprisingly as well the paper includes a randomised quasi-Monte Carlo version of the Gaussian importance sampler. (The authors report that “the improvement brought by RQMC varies strongly across datasets” without elaborating for the reasons behind this variability. They also do not report the CPU time of the IS-QMC, maybe identical to the one for the regular importance sampling.) Maybe more surprising is the absence of a nested sampling version.

In the Markov chain Monte Carlo solutions, Nicolas and James compare Gibbs, Metropolis-Hastings, Hamiltonian Monte Carlo, and NUTS. Plus a tempering SMC, All of which are outperformed by importance sampling for small enough datasets. But get back to competing grounds for large enough ones, since importance sampling then fails.

“…let’s all refrain from now on from using datasets and models that are too simple to serve as a reasonable benchmark.” (p.25)

This is a very nice survey on the theme of binary data (more than on the comparison of algorithms in that the authors do not really take into account design and complexity, but resort to MSEs versus CPus). I however do not agree with their overall message to leave the Pima Indians alone. Or at least not for the reason provided therein, namely that faster and more accurate approximations methods are available and cannot be beaten. Benchmarks always have the limitation of “what you get is what you see”, i.e., the output associated with a single dataset that only has that many idiosyncrasies. Plus, the closeness to a perfect normal posterior makes the logistic posterior too regular to pause a real challenge (even though MCMC algorithms are as usual slower than iid sampling). But having faster and more precise resolutions should on the opposite be cause for cheers, as this provides a reference value, a golden standard, to check against. In a sense, for every Monte Carlo method, there is a much better answer, namely the exact value of the integral or of the optimum! And one is hardly aiming at a more precise inference for the benchmark itself: those Pima Indians [whose actual name is Akimel O’odham] with diabetes involved in the original study are definitely beyond help from statisticians and the model is unlikely to carry out to current populations. When the goal is to compare methods, as in our 2009 paper for Jim Berger’s 60th birthday, what matters is relative speed and relative ease of implementation (besides the obvious convergence to the proper target). In that sense bigger and larger is not always relevant. Unless one tackles really big or really large datasets, for which there is neither benchmark method nor reference value.

## corrected MCMC samplers for multivariate probit models

Posted in Books, pictures, R, Statistics, University life with tags Bayesian modelling, Data augmentation, identifiability, Journal of Econometrics, MNP package, multivariate probit model, probit model, R, Wishart distribution on May 6, 2015 by xi'an

“Moreover, IvD point out an error in Nobile’s derivation which can alter its stationary distribution. Ironically, as we shall see, the algorithms of IvD also contain an error.”

**X**iyun Jiao and David A. van Dyk arXived a paper correcting an MCMC sampler and R package MNP for the multivariate probit model, proposed by Imai and van Dyk in 2005. [Hence the abbreviation IvD in the above quote.] Earlier versions of the Gibbs sampler for the multivariate probit model by Rob McCulloch and Peter Rossi in 1994, with a Metropolis update added by Agostino Nobile, and finally an improved version developed by Imai and van Dyk in 2005. As noted in the above quote, Jiao and van Dyk have discovered two mistakes in this latest version, jeopardizing the validity of the output.

The multivariate probit model considered here is a multinomial model where the occurrence of the k-th category is represented as the k-th component of a (multivariate) normal (correlated) vector being the largest of all components. The latent normal model being non-identifiable since invariant by either translation or scale, identifying constraints are used in the literature. This means using a covariance matrix of the form Σ/trace(Σ), where Σ is an inverse Wishart random matrix. In their 2005 implementation, relying on marginal data augmentation—which essentially means simulating the non-identifiable part repeatedly at various steps of the data augmentation algorithm—, Imai and van Dyk missed a translation term and a constraint on the simulated matrices that lead to simulations outside the rightful support, as illustrated from the above graph [snapshot from the arXived paper].

Since the IvD method is used in many subsequent papers, it is quite important that these mistakes are signalled and corrected. [Another snapshot above shows how much both algorithm differ!] Without much thinking about this, I [thus idly] wonder why an identifying prior is not taking the place of a hard identifying constraint, as it should solve the issue more nicely. In that it would create less constraints and more entropy (!) in exploring the augmented space, while theoretically providing a convergent approximation of the identifiable parts. I may (must!) however miss an obvious constraint preventing this implementation.

## probit posterior mean

Posted in Statistics, University life with tags Bayesian inference, generalised linear models, least squares, probit model on March 9, 2012 by xi'an**I**n a recent arXiv report, Yuzo Maruyma shows that the posterior expectation of a probit parameter has an almost closed form (under a flat prior), namely

where *ω* involves the integration of two quadratic forms over the *n*-dimensional unit sphere. Although this does not help directly with the MCMC derivation of the full posterior, this is an interesting lemma which shows a closed proximity with the standard least square estimate in linear regression.

## Vanilla Rao-Blackwellisation [re]revised

Posted in R, Statistics with tags control variates, MCMC, Metropolis-Hastings, Pima Indians, probit model, Rao-Blackwellisation, Wordpress on June 1, 2010 by xi'an**A**lthough the revision is quite minor, it took us two months to complete from the time I received the news in the Atlanta airport lounge… The vanilla Rao-Blackwellisation paper with Randal Douc has thus been resubmitted to the * Annals of Statistics*. And rearXived. The only significant change is the inclusion of two tables detailing computing time, like the one below

which provides different evaluations of the additional computing effort due to the use of the Rao–Blackwellisation: median and mean numbers of additional iterations, $80\%$ and $90\%$ quantiles for the additional iterations, and ratio of the average R computing times obtained over $10^5$ simulations. (Turning the above table into a formula acceptable by WordPress took me for ever, as any additional white space between the terms of the matrix is mis-interpreted!) Now, the mean time column does not look very supportive of the Rao-Blackwellisation technique, but this is due to the presence of a few outlying runs that required many iterations before hitting an acceptance probability of one. Excessive computing time can be curbed by using a pre-set number of iterations, as described in the paper…

## Vanilla Rao-Blackwellisation for revision

Posted in R, Statistics with tags control variates, MCMC, Metropolis-Hastings, Pima Indians, probit model, Rao-Blackwellisation on March 18, 2010 by xi'an**T**he vanilla Rao-Blackwellisation paper with Randal Douc that had been resubmitted to the * Annals of Statistics* is now back for a revision, with quite encouraging comments:

The paper has been reviewed by two referees both of whom comment on the clear exposition and the novelty of the results. Both referees point to the empirical results as being suggestive of a more incremental improvement in practice rather than a major advance. However the approach the authors adopt is novel and I believe may motivate further developments in this area.

**I** cannot but agree on those comments! Since we are reducing the variance of the weights, the overall effect may be difficult to spot in practical applications. In the current version of the paper, we manage 20% reduction in the variance of those weights, but obviously this does not transfer to the same reduction of the variance of the overall estimator! Our vanilla Rao-Blackwellisation does not speed up the Markov chain.

## Two more handbook chapters

Posted in Books, Statistics with tags ABC, ABC-SMC, Bayesian methods and expert elicitation, computational statistics, handbook of computational statistics, Monte Carlo methods, Pima Indian, population Monte Carlo, probit model, reversible jump on February 16, 2010 by xi'an**A**s mentioned in my earlier post, I had to write a revised edition to my chapter Bayesian Computational Methods in the * Handbook of Computational Statistics* (second edition), edited by J. Gentle, W. Härdle and Y. Mori. And in parallel I was asked for a second chapter in a handbook on risk analysis,

**, edited by Klaus Böckner. So, on Friday, I went over the first edition of this chapter of the**

*Bayesian methods and expert elicitation**and added the most recent developments I deemed important to mention, like ABC, as well as recent SMC and PMC algorithms, increasing the length by about ten pages. Simultaneously, Jean-Michel Marin completed my draft for the other handbook and I submitted both chapters, as well as arXived one and then the other.*

**Handbook of Computational Statistics****I**t is somehow interesting (on a lazy blizzardly Sunday afternoon with nothing better to do!) to look for the differences between those chapters aiming at the same description of important computational techniques for Bayesian statistics (and based on the same skeleton). The first chapter is broader and, with its 60 pages, it functions as a (very) short book on the topic. Given that the first version was written in 2003, the focus is more on latent variables with mixture models being repeatedly used as examples. Reversible jump also stands preeminently. In my opinion, it reads well and could be used as a primary entry for a short formal course on computational methods. (Even though ** Introducing Monte Carlo Methods with R** is presumably more appropriate for a short course.)

**T**he second chapter started from the skeleton of the earlier version of the first chapter with the probit model as the benchmark example. I worked on a first draft during the last vacations and then Jean-Michel took over to produce this current version, where reversible jump has been removed and ABC introduced with greater details. In particular, we used a very special version of ABC with the probit model, resorting to the distance between the expectations of the binary observables, namely

where is the MLE of based on the observations, instead of the difference between the simulated and the observed binary observables

which incorporates a useless randomness. With this choice, and when using for a .01 quantile, the difference with the true posterior on is very small, as shown by the figure (obtained for the Pima Indian dataset in R). Obviously, this stabilising trick only works in specific situations where a predictive of sorts can be computed.