Archive for International Statistical Review

Rao-Blackwellisation in the MCMC era

Posted in Books, Statistics, University life with tags , , , , , , , , , , on January 6, 2021 by xi'an

A few months ago, as indicated on this blog, I was contacted by ISR editors to write a piece on Rao-Blackwellisation, towards a special issue celebrating Calyampudi Radhakrishna Rao’s 100th birthday. Gareth Roberts and I came up with this survey, now on arXiv, discussing different aspects of Monte Carlo and Markov Chain Monte Carlo that pertained to Rao-Blackwellisation, one way or another. As I discussed the topic with several friends over the Fall, it appeared that the difficulty was more in setting the boundaries. Than in finding connections. In a way anything conditioning or demarginalising or resorting to auxiliary variates is a form of Rao-Blackwellisation. When re-reading the JASA Gelfand and Smith 1990 paper where I first saw the link between the Rao-Blackwell theorem and simulation, I realised my memory of it had drifted from the original, since the authors proposed there an approximation of the marginal based on replicas rather than the original Markov chain. Being much closer to Tanner and Wong (1987) than I thought. It is only later that the true notion took shape. [Since the current version is still a draft, any comment or suggestion would be most welcomed!]

practical Bayesian inference [book review]

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , , , , on April 26, 2018 by xi'an

[Disclaimer: I received this book of Coryn Bailer-Jones for a review in the International Statistical Review and intend to submit a revised version of this post as my review. As usual, book reviews on the ‘Og are reflecting my own definitely personal and highly subjective views on the topic!]

It is always a bit of a challenge to review introductory textbooks as, on the one hand, they are rarely written at the level and with the focus one would personally choose to write them. And, on the other hand, it is all too easy to find issues with the material presented and the way it is presented… So be warned and proceed cautiously! In the current case, Practical Bayesian Inference tries to embrace too much, methinks, by starting from basic probability notions (that should not be unknown to physical scientists, I believe, and which would avoid introducing a flat measure as a uniform distribution over the real line!, p.20). All the way to running MCMC for parameter estimation, to compare models by Bayesian evidence, and to cover non-parametric regression and bootstrap resampling. For instance, priors only make their apparition on page 71. With a puzzling choice of an improper prior (?) leading to an improper posterior (??), which is certainly not the smoothest entry on the topic. “Improper posteriors are a bad thing“, indeed! And using truncation to turn them into proper distributions is not a clear improvement as the truncation point will significantly impact the inference. Discussing about the choice of priors from the beginning has some appeal, but it may also create confusion in the novice reader (although one never knows!). Even asking about “what is a good prior?” (p.73) is not necessarily the best (and my recommended) approach to a proper understanding of the Bayesian paradigm. And arguing about the unicity of the prior (p.119) clashes with my own view of the prior being primarily a reference measure rather than an ideal summary of the available information. (The book argues at some point that there is no fixed model parameter, another and connected source of disagreement.) There is a section on assigning priors (p.113), but it only covers the case of a possibly biased coin without much realism. A feature common to many Bayesian textbooks though. To return to the issue of improper priors (and posteriors), the book includes several warnings about the danger of hitting an undefined posterior (still called a distribution), without providing real guidance on checking for its definition. (A tough question, to be sure.)

“One big drawback of the Metropolis algorithm is that it uses a fixed step size, the magnitude of which can hardly be determined in advance…”(p.165)

When introducing computational techniques, quadratic (or Laplace) approximation of the likelihood is mingled with kernel estimators, which does not seem appropriate. Proposing to check convergence and calibrate MCMC via ACF graphs is helpful in low dimensions, but not in larger dimensions. And while warning about the dangers of forgetting the Jacobians in the Metropolis-Hastings acceptance probability when using a transform like η=ln θ is well-taken, the loose handling of changes of variables may be more confusing than helpful (p.167). Discussing and providing two R codes for the (standard) Metropolis algorithm may prove too much. Or not. But using a four page R code for fitting a simple linear regression with a flat prior (pp.182-186) may definitely put the reader off! Even though I deem the example a proper experiment in setting a Metropolis algorithm and appreciate the detailed description around the R code itself. (I just take exception at the paragraph on running the code with two or even one observation, as the fact that “the Bayesian solution always exists” (p.188) [under a proper prior] is not necessarily convincing…)

“In the real world we cannot falsify a hypothesis or model any more than we “truthify” it (…) All we can do is ask which of the available models explains the data best.” (p.224)

In a similar format, the discussion on testing of hypotheses starts with a lengthy presentation of classical tests and p-values, the chapter ending up with a list of issues. Most of them reasonable in my own referential. I also concur with the conclusive remarks quoted above that what matters is a comparison of (all relatively false) models. What I less agree [as predictable from earlier posts and papers] with is the (standard) notion that comparing two models with a Bayes factor follows from the no information (in order to avoid the heavily loaded non-informative) prior weights of ½ and ½. Or similarly that the evidence is uniquely calibrated. Or, again, using a truncated improper prior under one of the assumptions (with the ghost of the Jeffreys-Lindley paradox lurking nearby…).  While the Savage-Dickey approximation is mentioned, the first numerical resolution of the approximation to the Bayes factor is via simulations from the priors. Which may be very poor in the situation of vague and uninformative priors. And then the deadly harmonic mean makes an entry (p.242), along with nested sampling… There is also a list of issues about Bayesian model comparison, including (strong) dependence on the prior, dependence on irrelevant alternatives, lack of goodness of fit tests, computational costs, including calls to possibly intractable likelihood function, ABC being then mentioned as a solution (which it is not, mostly).

Continue reading

maximum likelihood: an introduction

Posted in Books, Statistics with tags , , , , on December 20, 2014 by xi'an

“Basic Principle 0. Do not trust any principle.” L. Le Cam (1990)

Here is the abstract of a International Statistical Rewiew 1990 paper by Lucien Le Cam on maximum likelihood. ISR keeping a tradition of including an abstract in French for every paper, Le Cam (most presumably) wrote his own translation [or maybe wrote the French version first], which sounds much funnier to me and so I cannot resist posting both, pardon my/his French! [I just find “Ce fait” rather unusual, as I would have rather written “Ceci fait”…]:

Maximum likelihood estimates are reported to be best under all circumstances. Yet there are numerous simple examples where they plainly misbehave. One gives some examples for problems that had not been invented for the purpose of annoying maximum likelihood fans. Another example, imitated from Bahadur, has been specially created with just such a purpose in mind. Next, we present a list of principles leading to the construction of good estimates. The main principle says that one should not believe in principles but study each problem for its own sake.

L’auteur a ouï dire que la méthode du maximum de vraisemblance est la meilleure méthode d’estimation. C’est bien vrai, et pourtant la méthode se casse le nez sur des exemples bien simples qui n’avaient pas été inventés pour le plaisir de montrer que la méthode peut être très désagréable. On en donne quelques-uns, plus un autre, imité de Bahadur et fabriqué exprès pour ennuyer les admirateurs du maximum de vraisemblance. Ce fait, on donne une savante liste de principes de construction de bons estimateurs, le principe principal étant qu’il ne faut pas croire aux principes.

The entire paper is just as witty, as in describing the mixture model as “contaminated and not fit to drink”! Or in “Everybody knows that taking logarithms is unfair”. Or, again, in “biostatisticians, being complicated people, prefer to work out not with the dose y but with its logarithm”… And a last line: “One possibility is that there are too many horse hairs in e”.

someone who might benefit from increased contacts with the statistical community

Posted in Books, Statistics with tags , , , , , on July 23, 2012 by xi'an

A (kind of automated) email I got today:

Your name has come to our attention as someone who might benefit from increased contacts with the international statistical community. Given your professional interests and your statistical background (noting your publication ‘Reading Keynes’ Treatise on Probability’ in the journal International Statistical Review, volume 79, 2011), you should consider elected membership in the International Statistical Institute (ISI).

Hmmm, thanks but no thanks, I am not certain I need become a member of the ISI to increase my contacts with the international statistical community! (Disclaimer: This post makes fun of the anonymous emailing, not of the ISI!)

Confidence distributions

Posted in Books, Statistics, Travel, University life with tags , , , , , , , , on June 11, 2012 by xi'an

I was asked by the International Statistical Review editor, Marc Hallin, for a discussion of the paper “Confidence distribution, the frequentist distribution estimator of a parameter — a review” by Min-ge Xie and Kesar Singh, both from Rutgers University. Although the paper is not available on-line, similar and recent reviews and articles can be found, in an 2007 IMS Monograph and a 2012 JASA paper both with Bill Strawderman, as well as a chapter in the recent Fetschrift for Bill Strawderman. The notion of confidence distribution is quite similar to the one of fiducial distribution, introduced by R.A. Fisher, and they both share in my opinion the same drawback, namely that they aim at a distribution over the parameter space without specifying (at least explicitly) a prior distribution. Furthermore, the way the confidence distribution is defined perpetuates the on-going confusion between confidence and credible intervals, in that the cdf on the parameter θ is derived via the inversion of a confidence upper bound (or, equivalently, of a p-value…) Even though this inversion properly defines a cdf on the parameter space, there is no particular validity in the derivation. Either the confidence distribution corresponds to a genuine posterior distribution, in which case I think the only possible interpretation is a Bayesian one. Or  the confidence distribution does not correspond to a genuine posterior distribution, because no prior can lead to this distribution, in which case there is a probabilistic impossibility in using this distribution.  Thus, as a result, my discussion (now posted on arXiv) is rather negative about the benefits of this notion of confidence distribution.

One entry in the review, albeit peripheral, attracted my attention. The authors mention a tech’ report where they exhibit a paradoxical behaviour of a Bayesian procedure: given a (skewed) prior on a pair (p0,p1), and a binomial likelihood, the posterior distribution on p1-p0 has its main mass in the tails of both the prior and the likelihood (“the marginal posterior of d = p1-p0 is more extreme than its prior and data evidence!”). The information provided in the paper is rather sparse on the genuine experiment and looking at two possible priors exhibited nothing of the kind… I went to the authors’ webpages and found a more precise explanation on Min-ge Xie’s page:

Although the contour plot of the posterior distribution sits between those of the prior distribution and the likelihood function, its projected peak is more extreme than the other two. Further examination suggests that this phenomenon is genuine in binomial clinical trials and it would not go away even if we adopt other (skewed) priors (for example, the independent beta priors used in Joseph et al. (1997)). In fact, as long as the center of a posterior distribution is not on the line joining the two centers of the joint prior and likelihood function (as it is often the case with skewed distributions), there exists a direction along which the marginal posterior fails to fall between the prior and likelihood function of the same parameter.

and a link to another paper. Reading through the paper (and in particular Section 4), it appears that the above “paradoxical” picture is the result of the projections of the joint distributions represented in this second picture. By projection, I presume the authors mean integrating out the second component, e.g. p1+p0. This indeed provides the marginal prior of p1-p0, the marginal posterior of p1-p0, but…not the marginal likelihood of p1-p0! This entity is not defined, once again because there is no reference measure on the parameter space which could justify integrating out some parameters in the likelihood. (Overall, I do not think the “paradox” is overwhelming: the joint posterior distribution does precisely the merging of prior and data information we would expect and it is not like the marginal posterior is located in zones with zero prior probability and zero (profile) likelihood. I am also always wary of arguments based on modes, since those are highly dependent on parameterisation.)

Most unfortunately, when searching for more information on the authors’ webpages, I came upon the sad news that Professor Singh had passed away three weeks ago, at the age of 56.  (Professor Xie wrote a touching eulogy of his friend and co-author.) I had only met briefly with Professor Singh during my visit to Rutgers two months ago, but he sounded like an academic who would have enjoyed the kind of debate drafted by my discussion. To the much more important loss to family, friends and faculty represented by Professor Singh demise, I thus add the loss of missing the intellectual challenge of crossing arguments with him. And I look forward discussing the issues with the first author of the paper, Professor Xie.