Archive for convergence diagnostics

Statistical rethinking [book review]

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , , , on April 6, 2016 by xi'an

Statistical Rethinking: A Bayesian Course with Examples in R and Stan is a new book by Richard McElreath that CRC Press sent me for review in CHANCE. While the book was already discussed on Andrew’s blog three months ago, and [rightly so!] enthusiastically recommended by Rasmus Bååth on Amazon, here are the reasons why I am quite impressed by Statistical Rethinking!

“Make no mistake: you will wreck Prague eventually.” (p.10)

While the book has a lot in common with Bayesian Data Analysis, from being in the same CRC series to adopting a pragmatic and weakly informative approach to Bayesian analysis, to supporting the use of STAN, it also nicely develops its own ecosystem and idiosyncrasies, with a noticeable Jaynesian bent. To start with, I like the highly personal style with clear attempts to make the concepts memorable for students by resorting to external concepts. The best example is the call to the myth of the golem in the first chapter, which McElreath uses as an warning for the use of statistical models (which almost are anagrams to golems!). Golems and models [and robots, another concept invented in Prague!] are man-made devices that strive to accomplish the goal set to them without heeding the consequences of their actions. This first chapter of Statistical Rethinking is setting the ground for the rest of the book and gets quite philosophical (albeit in a readable way!) as a result. In particular, there is a most coherent call against hypothesis testing, which by itself justifies the title of the book. Continue reading

importance sampling with infinite variance

Posted in pictures, R, Statistics, University life with tags , , , , , , , on November 13, 2015 by xi'an

“In this article it is shown that in a fairly general setting, a sample of size approximately exp(D(μ|ν)) is necessary and sufficient for accurate estimation by importance sampling.”

Sourav Chatterjee and Persi Diaconis arXived yesterday an exciting paper where they study the proper sample size in an importance sampling setting with no variance. That’s right, with no variance. They give as a starting toy example the use of an Exp(1) proposal for an Exp(1/2) target, where the importance ratio exp(x/2)/2 has no ξ order moment (for ξ≥2). So the infinity in the variance is somehow borderline in this example, which may explain why the estimator could be considered to “work”. However, I disagree with the statement “that a sample size a few thousand suffices” for the estimator of the mean to be close to the true value, that is, 2. For instance, the picture I drew above is the superposition of 250 sequences of importance sampling estimators across 10⁵ iterations: several sequences show huge jumps, even for a large number of iterations, which are characteristic of infinite variance estimates. Thus, while the expected distance to the true value can be closely evaluated via the Kullback-Leibler divergence between the target and the proposal (which by the way is infinite when using a Normal as proposal and a Cauchy as target), there are realisations of the simulation path that can remain far from the true value and this for an arbitrary number of simulations. (I even wonder if, for a given simulation path, waiting long enough should not lead to those unbounded jumps.) The first result is frequentist, while the second is conditional, i.e., can occur for the single path we have just simulated… As I taught in class this very morning, I thus remain wary about using an infinite variance estimator. (And not only in connection with the harmonic mean quagmire. As shown below by the more extreme case of simulating an Exp(1) proposal for an Exp(1/10) target, where the mean is completely outside the range of estimates.) Wary, then, even though I find the enclosed result about the existence of a cut-off sample size associated with this L¹ measure quite astounding. Continue reading

consistency of ABC

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , on August 25, 2015 by xi'an

Along with David Frazier and Gael Martin from Monash University, Melbourne, we have just completed (and arXived) a paper on the (Bayesian) consistency of ABC methods, producing sufficient conditions on the summary statistics to ensure consistency of the ABC posterior. Consistency in the sense of the prior concentrating at the true value of the parameter when the sample size and the inverse tolerance (intolerance?!) go to infinity. The conditions are essentially that the summary statistics concentrates around its mean and that this mean identifies the parameter. They are thus weaker conditions than those found earlier consistency results where the authors considered convergence to the genuine posterior distribution (given the summary), as for instance in Biau et al. (2014) or Li and Fearnhead (2015). We do not require here a specific rate of decrease to zero for the tolerance ε. But still they do not hold all the time, as shown for the MA(2) example and its first two autocorrelation summaries, example we started using in the Marin et al. (2011) survey. We further propose a consistency assessment based on the main consistency theorem, namely that the ABC-based estimates of the marginal posterior densities for the parameters should vary little when adding extra components to the summary statistic, densities estimated from simulated data. And that the mean of the resulting summary statistic is indeed one-to-one. This may sound somewhat similar to the stepwise search algorithm of Joyce and Marjoram (2008), but those authors aim at obtaining a vector of summary statistics that is as informative as possible. We also examine the consistency conditions when using an auxiliary model as in indirect inference. For instance, when using an AR(2) auxiliary model for estimating an MA(2) model. And ODEs.

evaluating stochastic algorithms

Posted in Books, R, Statistics, University life with tags , , , , , , , , on February 20, 2014 by xi'an

Reinaldo sent me this email a long while ago

Could you recommend me a nice reference about 
measures to evaluate stochastic algorithms (in 
particular focus in approximating posterior 
distributions).

and I hope he is still reading the ‘Og, despite my lack of prompt reply! I procrastinated and procrastinated in answering this question as I did not have a ready reply… We have indeed seen (almost suffered from!) a flow of MCMC convergence diagnostics in the 90’s.  And then it dried out. Maybe because of the impossibility to be “really” sure, unless running one’s MCMC much longer than “necessary to reach” stationarity and convergence. The heat of the dispute between the “single chain school” of Geyer (1992, Statistical Science) and the “multiple chain school” of Gelman and Rubin (1992, Statistical Science) has since long evaporated. My feeling is that people (still) run their MCMC samplers several times and check for coherence between the outcomes. Possibly using different kernels on parallel threads. At best, but rarely, they run (one or another form of) tempering to identify the modal zones of the target. And instances where non-trivial control variates are available are fairly rare. Hence, a non-sequitur reply at the MCMC level. As there is no automated tool available, in my opinion. (Even though I did not check the latest versions of BUGS.)

As it happened, Didier Chauveau from Orléans gave today a talk at Big’MC on convergence assessment based on entropy estimation, a joint work with Pierre Vandekerkhove. He mentioned SamplerCompare which is an R package that appeared in 2010. Soon to come is their own EntropyMCMC package, using parallel simulation. And k-nearest neighbour estimation.

If I re-interpret the question as focussed on ABC algorithms, it gets both more delicate and easier. Easy because each ABC distribution is different. So there is no reason to look at the unreachable original target. Delicate because there are several parameters to calibrate (tolerance, choice of summary, …) on top of the number of MCMC simulations. In DIYABC, the outcome is always made of the superposition of several runs to check for stability (or lack thereof). But this tells us nothing about the distance to the true original target. The obvious but impractical answer is to use some basic bootstrapping, as it is generally much too costly.

Particle learning [rejoinder]

Posted in R, Statistics, University life with tags , , , , , , , , , , on November 10, 2010 by xi'an

Following the posting on arXiv of the Statistical Science paper of Carvalho et al., and the publication by the same authors in Bayesian Analysis of Particle Learning for general mixtures I noticed on Hedibert Lopes’ website his rejoinder to the discussion of his Valencia 9 paper has been posted. Since the discussion involved several points made by members of the CREST statistics lab (and covered the mixture paper as much as the Valencia 9 paper), I was quite eager to read Hedie’s reply. Unsurprisingly, this rejoinder is however unlikely to modify my reservations about particle learning. The following is a detailed examination of the arguments found in the rejoinder but requires a preliminary reading of the above papers as well as our discussion.. Continue reading

Bayes vs. SAS

Posted in Books, R, Statistics with tags , , , , , , , , , , , , , , , , , , on May 7, 2010 by xi'an

Glancing perchance at the back of my Amstat News, I was intrigued by the SAS advertisement

Bayesian Methods

  • Specify Bayesian analysis for ANOVA, logistic regression, Poisson regression, accelerated failure time models and Cox regression through the GENMOD, LIFEREG and PHREG procedures.
  • Analyze a wider variety of models with the MCMC procedure, a general purpose Bayesian analysis procedure.

and so decided to take a look at those items on the SAS website. (Some entries date back to 2006 so I am not claiming novelty in this post, just my reading through the manual!)

Even though I have not looked at a SAS program since the time in 1984 I was learning principal component and discriminant analysis by programming SAS procedures on punched cards, it seems the MCMC part is rather manageable (if you can manage SAS at all!), looking very much like a second BUGS to my bystander eyes, even to the point of including ARS algorithms! The models are defined in a BUGS manner, with priors on the side (and this includes improper priors, despite a confusing first example that mixes very large variances with vague priors for the linear model!). The basic scheme is a random walk proposal with adaptive scale or covariance matrix. (The adaptivity on the covariance matrix is slightly confusing in that the way it is described it does not seem to implement the requirements of Roberts and Rosenthal for sure convergence.) Gibbs sampling is not directly covered, although some examples are in essence using Gibbs samplers. Convergence is assessed via ca. 1995 methods à la Cowles and Carlin, including the rather unreliable Raftery and Lewis indicator, but so does Introducing Monte Carlo Methods with R, which takes advantage of the R coda package. I have not tested (!) any of the features in the MCMC procedure but judging from a quick skim through the 283 page manual everything looks reasonable enough. I wonder if anyone has ever tested a SAS program against its BUGS counterpart for efficiency comparison.

The Bayesian aspects are rather traditional as well, except for the testing issue. Indeed, from what I have read, SAS does not engage into testing and remains within estimation bounds, offering only HPD regions for variable selection without producing a genuine Bayesian model choice tool. I understand the issues with handling improper priors versus computing Bayes factors, as well as some delicate computational requirements, but this is a truly important chunk missing from the package. (Of course, the package contains a DIC (Deviance information criterion) capability, which may be seen as a substitute, but I have reservations about the relevance of DIC outside generalised linear models. Same difficulty with the posterior predictive.) As usual with SAS, the documentation is huge (I still remember the shelves after shelves of documentation volumes in my 1984 card-punching room!) and full of options and examples. Nothing to complain about. Except maybe the list of disadvantages in using Bayesian analysis:

  • It does not tell you how to select a prior. There is no correct way to choose a prior. Bayesian inferences require skills to translate prior beliefs into a mathematically formulated prior. If you do not proceed with caution, you can generate misleading results.
  • It can produce posterior distributions that are heavily influenced by the priors. From a practical point of view, it might sometimes be difficult to convince subject matter experts who do not agree with the validity of the chosen prior.
  • It often comes with a high computational cost, especially in models with a large number of parameters.

which does not say much… Since the MCMC procedure allows for any degree of hierarchical modelling, it is always possible to check the impact of a given prior by letting its parameters go random. I found that most practitioners are happy with the formalisation of their prior beliefs into mathematical densities, rather than adamant about a specific prior. As for computation, this is not a major issue.