**E**dwin Fong and Chris Holmes (Oxford) just wrote a paper on Bayesian scalable methods from a M-open perspective. Borrowing from the conformal prediction framework of Vovk et al. (2005) to achieve frequentist coverage for prediction intervals. The method starts with the choice of a conformity measure that measures how well each observation in the sample agrees with the sample. Which is exchangeable and hence leads to a rank statistic from which a p-value can be derived. Which is the empirical cdf associated with the observed conformities. Following Vovk et al. (2005) and Wasserman (2011) Edwin and Chris note that the Bayesian predictive itself acts like a conformity measure. Predictive that can itself be approximated by MCMC and importance sampling (possibly smoothed by Pareto). The paper also expands the setting to partial exchangeable models, renamed group conformal predictions. While reluctant to engage into turning Bayesian solutions into frequentist ones, I can see some worth in deriving both in order to expose discrepancies and hence signal possible issues with models and priors.

## Archive for p-value

## Conformal Bayesian Computation

Posted in Books, pictures, Statistics, University life with tags Bayesian predictive, conformal prediction, frequentist coverage, p-value, Pareto smoothed importance sampling, University of Oxford on July 8, 2021 by xi'an## over-confident about mis-specified models?

Posted in Books, pictures, Statistics, University life with tags ABC, ABC model choice, all models are wrong, Bayesian model comparison, Charles Darwin, DIC, Kullback-Leibler divergence, model posterior probabilities, National Academy of Science, Ockham's razor, On the Origin of Species, p-value, phylogenetic models, PNAS, random forests, Steve Fienberg on April 30, 2019 by xi'an**Z**iheng Yang and Tianqui Zhu published a paper in PNAS last year that criticises Bayesian posterior probabilities used in the comparison of models under misspecification as “overconfident”. The paper is written from a phylogeneticist point of view, rather than from a statistician’s perspective, as shown by the Editor in charge of the paper [although I thought that, after Steve Fienberg‘s intervention!, a statistician had to be involved in a submission relying on statistics!] a paper , but the analysis is rather problematic, at least seen through my own lenses… With no statistical novelty, apart from looking at the distribution of posterior probabilities in toy examples. The starting argument is that Bayesian model comparison is often reporting posterior probabilities in favour of a particular model that are close or even equal to 1.

“The Bayesian method is widely used to estimate species phylogenies using molecular sequence data. While it has long been noted to produce spuriously high posterior probabilities for trees or clades, the precise reasons for this over confidence are unknown. Here we characterize the behavior of Bayesian model selection when the compared models are misspecified and demonstrate that when the models are nearly equally wrong, the method exhibits unpleasant polarized behaviors,supporting one model with high confidence while rejecting others. This provides an explanation for the empirical observation of spuriously high posterior probabilities in molecular phylogenetics.”

The paper focus on the behaviour of posterior probabilities to strongly support a model against others when the sample size is large enough, “even when” all models are wrong, the argument being apparently that the correct output should be one of equal probability between models, or maybe a uniform distribution of these model probabilities over the probability simplex. Why should it be so?! The construction of the posterior probabilities is based on a meta-model that assumes the generating model to be part of a list of mutually exclusive models. It does not account for cases where “all models are wrong” or cases where “all models are right”. The reported probability is furthermore epistemic, in that it is relative to the measure defined by the prior modelling, not to a promise of a frequentist stabilisation in a ill-defined asymptotia. By which I mean that a 99.3% probability of model M¹ being “true”does not have a universal and objective meaning. (*Moderation note:* the high polarisation of posterior probabilities was instrumental in our investigation of model choice with ABC tools and in proposing instead error rates in ABC random forests.)

The notion that two models are equally wrong because they are both exactly at the same Kullback-Leibler distance from the generating process (when optimised over the parameter) is such a formal [or cartoonesque] notion that it does not make much sense. There is always one model that is slightly closer and eventually takes over. It is also bizarre that the argument does not account for the complexity of each model and the resulting (Occam’s razor) penalty. Even two models with a single parameter are not necessarily of intrinsic dimension one, as shown by DIC. And thus it is not a surprise if the posterior probability mostly favours one versus the other. In any case, an healthily sceptic approach to Bayesian model choice means looking at the behaviour of the procedure (Bayes factor, posterior probability, posterior predictive, mixture weight, &tc.) under various assumptions (model M¹, M², &tc.) to calibrate the numerical value, rather than taking it at face value. By which I do not mean a frequentist evaluation of this procedure. Actually, it is rather surprising that the authors of the PNAS paper do not jump on the case when the posterior probability of model M¹ say is uniformly distributed, since this would be a perfect setting when the posterior probability is a p-value. (This is also what happens to the bootstrapped version, see the last paragraph of the paper on p.1859, the year Darwin published his Origin of Species.)

## a resolution of the Jeffreys-Lindley paradox

Posted in Books, Statistics, University life with tags Electronic Journal of Statistics, Jeffreys-Lindley paradox, p-value, Type I error, Type II error on April 24, 2019 by xi'an

“…it is possible to have the best of both worlds. If one allows the significance level to decrease as the sample size gets larger (…) there will be a finite number of errors made with probability one. By allowing the critical values to diverge slowly, one may catch almost all the errors.” (p.1527)

**W**hen commenting another post, Michael Naaman pointed out to me his 2016 Electronic Journal of Statistics paper where he resolves the Jeffreys-Lindley paradox. The argument there is to consider a Type I error going to zero with the sample size n going to infinity but slowly enough for both Type I and Type II errors to go to zero. And guarantee a finite number of errors as the sample size n grows to infinity. This translates for the Jeffreys-Lindley paradox into a pivotal quantity within the posterior probability of the null that converges to zero with n going to infinity. Hence makes it (most) agreeable with the Type I error going to zero. Except that there is little reason to assume this pivotal quantity goes to infinity with n, despite its distribution remaining constant in n. Being constant is less unrealistic, by comparison! That there exists an hypothetical sequence of observations such that the p-value and the posterior probability agree, even exactly, does not “solve” the paradox in my opinion.

## p-value graffiti in the lift [jatp]

Posted in Statistics with tags 1960s, basic statistics, course, jatp, lift, p-value, picture, teaching, Université Paris Dauphine on January 3, 2019 by xi'an## a null hypothesis with a 99% probability to be true…

Posted in Books, R, Statistics, University life with tags cross validated, hypothesis testing, np.random.standard_t, null hypothesis, numpy, p-value, pseudo-random generator, Python, R, t distribution, W. Gosset, x.std on March 28, 2018 by xi'an**W**hen checking the Python t distribution random generator, np.random.standard_t(), I came upon this manual page, which actually does not explain how the random generator works but spends instead the whole page to recall Gosset’s *t* test, illustrating its use on an energy intake of 11 women, but ending up misleading the readers by interpreting a .009 one-sided p-value as meaning “the null hypothesis [on the hypothesised mean] has a probability of about 99% of being true”! Actually, Python’s standard deviation estimator x.std() further returns by default a non-standard standard deviation, dividing by n rather than n-1…