## Archive for invariance

Posted in Books, Kids, Statistics, University life with tags , , , , , , on December 4, 2014 by xi'an

Today was the second session of our Reading Classics Seminar for the academic year 2014-2015. I have not reported on this seminar so far because it has had starting problems, namely hardly any student present on the first classes and therefore several re-starts until we reach a small group of interested students. Actually, this is the final year for my TSI Master at Paris-Dauphine, as it will become integrated within the new MASH Master next year. The latter started this year and drew away half of our potential applicants, presumably because of the wider spectrum between machine-learning, optimisation, programming and a tiny bit of statistics… If we manage to salvage [within the new Master] our speciality of offering the only Bayesian Statistics training in France, this will not be a complete disaster!

Anyway, the first seminar was about the great 1939 Biometrika paper by Pitman about the best invariant estimator appearing magically as a Bayes estimator! Alas, the student did not grasp the invariance part and hence focussed on less relevant technical parts, which was not a great experience (and therefore led me to abstain from posting the slides here). The second paper was not on my list but was proposed by another student as of yesterday when he realised he was to present today! This paper, entitled “The Counter-intuitive Non-informative Prior for the Bernoulli Family”, was published in the Journal of Statistics Education in 2004 by Zu and Liu, I had not heard of the paper (or of the journal) previously and I do not think it is worth advertising any further as it gives a very poor entry to non-informative priors in the simplest of settings, namely for Bernoulli B(p) observations. Indeed, the stance of the paper is to define a non-informative prior as one returning the MLE of p as its posterior expectation (missing altogether the facts that such a definition is parameterisation-invariant and that, given the modal nature of the MLE, a posterior mode would be much more appropriate, leading to the uniform prior of p as a solution) and that the corresponding prior was made of two Dirac masses at 0 and 1! Which again misses several key points like defining properly convergence in a space of probability distributions and using an improper prior differently from a proper prior. Esp. since in the next section, the authors switch to Haldane’s prior being the Be(0,0) distribution..! A prior that cannot be used since the posterior is not defined when all the observations are identical. Certainly not a paper to make it to the list! (My student simply pasted pages from this paper as his slides and so I see again no point in reposting them here. )

## label switching in Bayesian mixture models

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on October 31, 2014 by xi'an

A referee of our paper on approximating evidence for mixture model with Jeong Eun Lee pointed out the recent paper by Carlos Rodríguez and Stephen Walker on label switching in Bayesian mixture models: deterministic relabelling strategies. Which appeared this year in JCGS and went beyond, below or above my radar.

Label switching is an issue with mixture estimation (and other latent variable models) because mixture models are ill-posed models where part of the parameter is not identifiable. Indeed, the density of a mixture being a sum of terms

$\sum_{j=1}^k \omega_j f(y|\theta_i)$

the parameter (vector) of the ω’s and of the θ’s is at best identifiable up to an arbitrary permutation of the components of the above sum. In other words, “component #1 of the mixture” is not a meaningful concept. And hence cannot be estimated.

This problem has been known for quite a while, much prior to EM and MCMC algorithms for mixtures, but it is only since mixtures have become truly estimable by Bayesian approaches that the debate has grown on this issue. In the very early days, Jean Diebolt and I proposed ordering the components in a unique way to give them a meaning. For instant, “component #1″ would then be the component with the smallest mean or the smallest weight and so on… Later, in one of my favourite X papers, with Gilles Celeux and Merrilee Hurn, we exposed the convergence issues related with the non-identifiability of mixture models, namely that the posterior distributions were almost always multimodal, with a multiple of k! symmetric modes in the case of exchangeable priors, and therefore that Markov chains would have trouble to visit all those modes in a symmetric manner, despite the symmetry being guaranteed from the shape of the posterior. And we conclude with the slightly provocative statement that hardly any Markov chain inferring about mixture models had ever converged! In parallel, time-wise, Matthew Stephens had completed a thesis at Oxford on the same topic and proposed solutions for relabelling MCMC simulations in order to identify a single mode and hence produce meaningful estimators. Giving another meaning to the notion of “component #1″.

And then the topic began to attract more and more researchers, being both simple to describe and frustrating in its lack of definitive answer, both from simulation and inference perspectives. Rodriguez’s and Walker’s paper provides a survey on the label switching strategies in the Bayesian processing of mixtures, but its innovative part is in deriving a relabelling strategy. Which consists of finding the optimal permutation (at each iteration of the Markov chain) by minimising a loss function inspired from k-means clustering. Which is connected with both Stephens’ and our [JASA, 2000] loss functions. The performances of this new version are shown to be roughly comparable with those of other relabelling strategies, in the case of Gaussian mixtures. (Making me wonder if the choice of the loss function is not favourable to Gaussian mixtures.) And somehow faster than Stephens’ Kullback-Leibler loss approach.

“Hence, in an MCMC algorithm, the indices of the parameters can permute multiple times between iterations. As a result, we cannot identify the hidden groups that make [all] ergodic averages to estimate characteristics of the components useless.”

One section of the paper puzzles me, albeit it does not impact the methodology and the conclusions. In Section 2.1 (p.27), the authors consider the quantity

$p(z_i=j|{\mathbf y})$

which is the marginal probability of allocating observation i to cluster or component j. Under an exchangeable prior, this quantity is uniformly equal to 1/k for all observations i and all components j, by virtue of the invariance under permutation of the indices… So at best this can serve as a control variate. Later in Section 2.2 (p.28), the above sentence does signal a problem with those averages but it seem to attribute it to MCMC behaviour rather than to the invariance of the posterior (or to the non-identifiability of the components per se). At last, the paper mentions that “given the allocations, the likelihood is invariant under permutations of the parameters and the allocations” (p.28), which is not correct, since eqn. (8)

$f(y_i|\theta_{\sigma(z_i)}) =f(y_i|\theta_{\tau(z_i)})$

does not hold when the two permutations σ and τ give different images of zi

## posterior likelihood ratio is back

Posted in Statistics, University life with tags , , , , , , , , , on June 10, 2014 by xi'an

“The PLR turns out to be a natural Bayesian measure of evidence of the studied hypotheses.”

Isabelle Smith and André Ferrari just arXived a paper on the posterior distribution of the likelihood ratio. This is in line with Murray Aitkin’s notion of considering the likelihood ratio

$f(x|\theta_0) / f(x|\theta)$

as a prior quantity, when contemplating the null hypothesis that θ is equal to θ0. (Also advanced by Alan Birnbaum and Arthur Dempster.) A concept we criticised (rather strongly) in our Statistics and Risk Modelling paper with Andrew Gelman and Judith Rousseau.  The arguments found in the current paper in defence of the posterior likelihood ratio are quite similar to Aitkin’s:

• defined for (some) improper priors;
• invariant under observation or parameter transforms;
• more informative than tthe posterior mean of the posterior likelihood ratio, not-so-incidentally equal to the Bayes factor;
• avoiding using the posterior mean for an asymmetric posterior distribution;
• achieving some degree of reconciliation between Bayesian and frequentist perspectives, e.g. by being equal to some p-values;
• easily computed by MCMC means (if need be).

One generalisation found in the paper handles the case of composite versus composite hypotheses, of the form

$\int\mathbb{I}\left( p(x|\theta_1)

which brings back an earlier criticism I raised (in Edinburgh, at ICMS, where as one-of-those-coincidences, I read this paper!), namely that using the product of the marginals rather than the joint posterior is no more a standard Bayesian practice than using the data in a prior quantity. And leads to multiple uses of the data. Hence, having already delivered my perspective on this approach in the past, I do not feel the urge to “raise the flag” once again about a paper that is otherwise well-documented and mathematically rich.

## Bayesian variable selection redux

Posted in Statistics, University life with tags , , , , , on July 11, 2011 by xi'an

After a rather long interlude, and just in time for the six month deadline!, we (Gilles Celeux, Mohammed El Anbari, Jean-Michel Marin and myself) have resubmitted (and rearXived) our comparative study of Bayesian and non-Bayesian variable selections procedures to Bayesian Analysis. Why it took us so long is a combination of good and bad reasons: besides being far apart, between Morocco, Paris and Montpellier, and running too many projects at once with Jean-Michel (including the Bayesian Core revision that did not move much since last summer!), we came to realise that my earlier strong stance that invariance on the intercept did not matter was not right and that the (kind) reviewers were correct about the asymptotic impact of the scale of the intercept on the variable selection, so we had first to reconvene and think about it, before running another large round of simulations. We hope the picture is now clearer.

## Thesis defense in València

Posted in Statistics, Travel, University life, Wines with tags , , , , , , on February 25, 2011 by xi'an

On Monday, I took part in the jury of the PhD thesis of Anabel Forte Deltel, in the department of statistics of the Universitat de València. The topic of the thesis was variable selection in Gaussian linear models using an objective Bayes approach. Completely on my own research agenda! I had already discussed with Anabel in Zürich, where she gave a poster and gave me a copy of her thesis, so could concentrate on the fundamentals of her approach during the defense. Her approach extends Liang et al. (2008, JASA) hyper-g prior in a complete analysis of the conditions set by Jeffreys in his book for constructing such priors. She is therefore able to motivate a precise value for most hyperparameters (although some choices were mainly based on computational reasons opposing 2F1 with Appell’s F1 hypergeometric functions). She also defends the use of an improper prior by an invariance argument that leads to the standard Jeffreys’ prior on location-scale. (This is where I prefer the approach in Bayesian Core that does not discriminate between a subset of the covariates including the intercept and the other covariates. Even though it is not invariant by location-scale transforms.) After the defence, Jim Berger pointed out to me that the modelling allowed for the subset to be empty, which would then cancel my above objection! In conclusion, this thesis could well set a reference prior (if not in José Bernardo’s sense of the term!) for Bayesian linear model analysis in the coming years.

## Back from Philly

Posted in R, Statistics, Travel, University life with tags , , , , , , , , , on December 21, 2010 by xi'an

$f(i) \propto \log_{10}(1+\frac{1}{i})$
and not that $\log(X) \,(\text{mod}\,1)$ is uniform, which is the presentation given in the arXived note…. The former is also the interpretation of William Feller (page 63, Introduction to Probability Theory), contrary to what the arXived note seems to imply on page 2, but Feller indeed mentioned as an informal/heuristic argument in favour of Benford’s Law that when the spread of the rv X is large,  $\log(X)$ is approximately uniformly distributed. (I would no call this a “fundamental flaw“.) The arXived note is then right in pointing out the lack of foundation for Feller’s heuristic, if muddling the issue by defining several non-equivalent versions of Benford’s Law. It is also funny that this arXived note picks at the scale-invariant characterisation of Benford’s Law when Terry Tao’s entry represents it as a special case of Haar measure!