## all models are wrong

Posted in Statistics, University life with tags , , , , , , , on September 27, 2014 by xi'an

“Using ABC to evaluate competing models has various hazards and comes with recommended precautions (Robert et al. 2011), and unsurprisingly, many if not most researchers have a healthy scepticism as these tools continue to mature.”

Michael Hickerson just published an open-access letter with the above title in Molecular Ecology. (As in several earlier papers, incl. the (in)famous ones by Templeton, Hickerson confuses running an ABC algorithm with conducting Bayesian model comparison, but this is not the main point of this post.)

“Rather than using ABC with weighted model averaging to obtain the three corresponding posterior model probabilities while allowing for the handful of model parameters (θ, τ, γ, Μ) to be estimated under each model conditioned on each model’s posterior probability, these three models are sliced up into 143 ‘submodels’ according to various parameter ranges.”

The letter is in fact a supporting argument for the earlier paper of Pelletier and Carstens (2014, Molecular Ecology) which conducted the above splitting experiment. I could not read this paper so cannot judge of the relevance of splitting this way the parameter range. From what I understand it amounts to using mutually exclusive priors by using different supports.

“Specifically, they demonstrate that as greater numbers of the 143 sub-models are evaluated, the inference from their ABC model choice procedure becomes increasingly.”

An interestingly cut sentence. Increasingly unreliable? mediocre? weak?

“…with greater numbers of models being compared, the most probable models are assigned diminishing levels of posterior probability. This is an expected result…”

True, if the number of models under consideration increases, under a uniform prior over model indices, the posterior probability of a given model mechanically decreases. But the pairwise Bayes factors should not be impacted by the number of models under comparison and the letter by Hickerson states that Pelletier and Carstens found the opposite:

“…pairwise Bayes factor[s] will always be more conservative except in cases when the posterior probabilities are equal for all models that are less probable than the most probable model.”

Which means that the “Bayes factor” in this study is computed as the ratio of a marginal likelihood and of a compound (or super-marginal) likelihood, averaged over all models and hence incorporating the prior probabilities of the model indices as well. I had never encountered such a proposal before. Contrary to the letter’s claim:

“…using the Bayes factor, incorporating all models is perhaps more consistent with the Bayesian approach of incorporating all uncertainty associated with the ABC model choice procedure.”

Besides the needless inclusion of ABC in this sentence, a somewhat confusing sentence, as Bayes factors are not, stricto sensu, Bayesian procedures since they remove the prior probabilities from the picture.

“Although the outcome of model comparison with ABC or other similar likelihood-based methods will always be dependent on the composition of the model set, and parameter estimates will only be as good as the models that are used, model-based inference provides a number of benefits.”

All models are wrong but the very fact that they are models allows for producing pseudo-data from those models and for checking if the pseudo-data is similar enough to the observed data. In components that matters the most for the experimenter. Hence a loss function of sorts…

## my life as a mixture [BAYSM 2014, Wien]

Posted in Books, Kids, Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , , , on September 12, 2014 by xi'an

Next week I am giving a talk at BAYSM in Vienna. BAYSM is the Bayesian Young Statisticians meeting so one may wonder why, but with Chris Holmes and Mike West, we got invited as more… erm… senior speakers! So I decided to give a definitely senior talk on a thread pursued throughout my career so far, namely mixtures. Plus it also relates to works of the other senior speakers. Here is the abstract for the talk:

Mixtures of distributions are fascinating objects for statisticians in that they both constitute a straightforward extension of standard distributions and offer a complex benchmark for evaluating statistical procedures, with a likelihood both computable in a linear time and enjoying an exponential number of local models (and sometimes infinite modes). This fruitful playground appeals in particular to Bayesians as it constitutes an easily understood challenge to the use of improper priors and of objective Bayes solutions. This talk will review some ancient and some more recent works of mine on mixtures of distributions, from the 1990 Gibbs sampler to the 2000 label switching and to later studies of Bayes factor approximations, nested sampling performances, improper priors, improved importance samplers, ABC, and a inverse perspective on the Bayesian approach to testing of hypotheses.

I am very grateful to the scientific committee for this invitation, as it will give me the opportunity to meet the new generation, learn from them and in addition discover Vienna where I have never been, despite several visits to Austria. Including its top, the Großglockner. I will also give a seminar in Linz the day before. In the Institut für Angewandte Statistik.

## independent component analysis and p-values

Posted in pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , on September 8, 2014 by xi'an

Last morning at the neuroscience workshop Jean-François Cardoso presented independent component analysis though a highly pedagogical and enjoyable tutorial that stressed the geometric meaning of the approach, summarised by the notion that the (ICA) decomposition

$X=AS$

of the data X seeks both independence between the columns of S and non-Gaussianity. That is, getting as away from Gaussianity as possible. The geometric bits came from looking at the Kullback-Leibler decomposition of the log likelihood

$-\mathbb{E}[\log L(\theta|X)] = KL(P,Q_\theta) + \mathfrak{E}(P)$

where the expectation is computed under the true distribution P of the data X. And Qθ is the hypothesised distribution. A fine property of this decomposition is a statistical version of Pythagoreas’ theorem, namely that when the family of Qθ‘s is an exponential family, the Kullback-Leibler distance decomposes into

$KL(P,Q_\theta) = KL(P,Q_{\theta^0}) + KL(Q_{\theta^0},Q_\theta)$

where θ⁰ is the expected maximum likelihood estimator of θ. (We also noticed this possibility of a decomposition in our Kullback-projection variable-selection paper with Jérôme Dupuis.) The talk by Aapo Hyvärinen this morning was related to Jean-François’ in that it used ICA all the way to a three-level representation if oriented towards natural vision modelling in connection with his book and the paper on unormalised models recently discussed on the ‘Og.

On the afternoon, Eric-Jan Wagenmaker [who persistently and rationally fight the (ab)use of p-values and who frequently figures on Andrew's blog] gave a warning tutorial talk about the dangers of trusting p-values and going fishing for significance in existing studies, much in the spirit of Andrew’s blog (except for the defence of Bayes factors). Arguing in favour of preregistration. The talk was full of illustrations from psychology. And included the line that ESP testing is the jester of academia, meaning that testing for whatever form of ESP should be encouraged as a way to check testing procedures. If a procedure finds a significant departure from the null in this setting, there is something wrong with it! I was then reminded that Eric-Jan was one of the authors having analysed Bem’s controversial (!) paper on the “anomalous processes of information or energy transfer that are currently unexplained in terms of known physical or biological mechanisms”… (And of the shocking talk by Jessica Utts on the same topic I attended in Australia two years ago.)

## JSM 2014, Boston [#3]

Posted in Statistics, University life with tags , , , , , , , on August 8, 2014 by xi'an

Today I gave a talk in the Advances in model selection session. Organised by Veronika Rockova and Ed George. (A bit of pre-talk stress: I actually attempted to change my slides at 5am and only managed to erase the current version! I thus left early enough to stop by the presentation room…) Here are the final slides, which have much in common with earlier versions, but also borrowed from Jean-Michel Marin’s talk in Cambridge. A posteriori, I think the talk missed one slide on the practical run of the ABC random forest algorithm, since later questions showed miscomprehension from the audience.

The other talks in this session were by Andreas Buja [whom I last met in Budapest last year] on valid post-modelling inference. A very relevant reflection on the fundamental bias in statistical modelling. Then by Nick Polson, about efficient ways to compute MAP for objective functions that are irregular.  Great entry into optimisation methods I had never heard of earlier.! (The abstract is unrelated.) And last but not least by Veronika Rockova, on mixing Indian buffet processes with spike-and-slab priors for factor analysis with unknown numbers of factors. A definitely advanced contribution to factor analysis, with a very nice idea of introducing a non-identifiable rotation to align on orthogonal designs. (Here too the abstract is unrelated, a side effect of the ASA requiring abstracts sent very long in advance.)

Although discussions lasted well into the following Bayesian Inference: Theory and Foundations session, I managed to listen to a few talks there. In particular, a talk by Keli Liu on constructing non-informative priors. A question of direct relevance. The notion of objectivity is to achieve a frequentist distribution of the Bayes factor associated with the point null that is constant. Or has a constant quantile at a given level. The second talk by Alexandra Bolotskikh related to older interests of mine’s, namely the construction of improved confidence regions in the spirit of Stein. (Not that surprising, given that a coauthor is Marty Wells, who worked with George and I on the topic.) A third talk by Abhishek Pal Majumder (jointly with Jan Hanning) dealt on a new type of fiducial distributions, with matching prior properties. This sentence popped a lot over the past days, but this is yet another area where I remain puzzled by the very notion. I mean the notion of fiducial distribution. Esp. in this case where the matching prior gets even closer to being plain Bayesian.

## AppliBUGS day celebrating Jean-Louis Foulley

Posted in pictures, Statistics, University life with tags , , , , , , on June 10, 2014 by xi'an

In case you are in Paris tomorrow and free, there will be an AppliBUGS day focussing on the contributions of our friend Jean-Louis Foulley. (And a regular contributor to the ‘Og!) The meeting takes place in the ampitheatre on second floor of  ENGREF-Montparnasse (19 av du Maine, 75015 Paris, Métro Montparnasse Bienvenüe). I will give a part of the O’Bayes tutorial on alternatives to the Bayes factor.

## a refutation of Johnson’s PNAS paper

Posted in Books, Statistics, University life with tags , , , , , , , on February 11, 2014 by xi'an

Jean-Christophe Mourrat recently arXived a paper “P-value tests and publication bias as causes for high rate of non-reproducible scientific results?”, intended as a rebuttal of Val Johnson’s PNAS paper. The arguments therein are not particularly compelling. (Just as ours’ may sound so to the author.)

“We do not discuss the validity of this [Bayesian] hypothesis here, but we explain in the supplementary material that if taken seriously, it leads to incoherent results, and should thus be avoided for practical purposes.”

The refutation is primarily argued as a rejection of the whole Bayesian perspective. (Although we argue Johnson’ perspective is not that Bayesian…) But the argument within the paper is much simpler: if the probability of rejection under the null is at most 5%, then the overall proportion of false positives is also at most 5% and not 20% as argued in Johnson…! Just as simple as this. Unfortunately, the author mixes conditional and unconditional, frequentist and Bayesian probability models. As well as conditioning upon the data and conditioning upon the rejection region… Read at your own risk. Continue reading

## On the use of marginal posteriors in marginal likelihood estimation via importance-sampling

Posted in R, Statistics, University life with tags , , , , , , , , , , , , , on November 20, 2013 by xi'an

Perrakis, Ntzoufras, and Tsionas just arXived a paper on marginal likelihood (evidence) approximation (with the above title). The idea behind the paper is to base importance sampling for the evidence on simulations from the product of the (block) marginal posterior distributions. Those simulations can be directly derived from an MCMC output by randomly permuting the components. The only critical issue is to find good approximations to the marginal posterior densities. This is handled in the paper either by normal approximations or by Rao-Blackwell estimates. the latter being rather costly since one importance weight involves B.L computations, where B is the number of blocks and L the number of samples used in the Rao-Blackwell estimates. The time factor does not seem to be included in the comparison studies run by the authors, although it would seem necessary when comparing scenarii.

After a standard regression example (that did not include Chib’s solution in the comparison), the paper considers  2- and 3-component mixtures. The discussion centres around label switching (of course) and the deficiencies of Chib’s solution against the current method and Neal’s reference. The study does not include averaging Chib’s solution over permutations as in Berkoff et al. (2003) and Marin et al. (2005), an approach that does eliminate the bias. Especially for a small number of components. Instead, the authors stick to the log(k!) correction, despite it being known for being quite unreliable (depending on the amount of overlap between modes). The final example is Diggle et al. (1995) longitudinal Poisson regression with random effects on epileptic patients. The appeal of this model is the unavailability of the integrated likelihood which implies either estimating it by Rao-Blackwellisation or including the 58 latent variables in the analysis.  (There is no comparison with other methods.)

As a side note, among the many references provided by this paper, I did not find trace of Skilling’s nested sampling or of safe harmonic means (as exposed in our own survey on the topic).