**L**ast December, Gunnar Taraldsen, Jarle Tufto, and Bo H. Lindqvist arXived a paper on using priors that lead to improper posteriors and [trying to] getting away with it! The central concept in their approach is Rényi’s generalisation of Kolmogorov’s version to define conditional probability distributions from infinite mass measures by conditioning on finite mass measurable sets. A position adopted by Dennis Lindley in his 1964 book .And already discussed in a few ‘Og’s posts. While the theory thus developed indeed allows for the manipulation of improper posteriors, I have difficulties with the inferential aspects of the construct, since one cannot condition on an arbitrary finite measurable set without prior information. Things get a wee bit more outwardly when considering “data” with infinite mass, in Section 4.2, since they cannot be properly normalised (although I find the example of the degenerate multivariate Gaussian distribution puzzling as it is not a matter of improperness, since the degenerate Gaussian has a well-defined density against the right dominating measure). The paper also discusses marginalisation paradoxes, by acknowledging that marginalisation is no longer feasible with improper quantities. And the Jeffreys-Lindley paradox, with a resolution that uses the sum of the Dirac mass at the null, δ⁰, and of the Lebesgue measure on the real line, λ, as the dominating measure. This indeed solves the issue of the arbitrary constant in the Bayes factor, since it is “the same” on the null hypothesis and elsewhere, but I do not buy the argument, as I see no reason to favour δ⁰+λ over 3.141516 δ⁰+λ or δ⁰+1.61718 λ… (This section 4.5 also illustrates that the choice of the sequence of conditioning sets has an impact on the limiting measure, in the Rényi sense.) In conclusion, after reading the paper, I remain uncertain as to how to exploit this generalisation from an inferential (Bayesian?) viewpoint, since improper posteriors do not clearly lead to well-defined inferential procedures…

## Archive for Jeffreys-Lindley paradox

## statistics with improper posteriors [or not]

Posted in Statistics with tags Alfréd Rényi, Andrei Kolmogorov, Dennis Lindley, improper posteriors, improper priors, Jeffreys-Lindley paradox, marginalisation paradoxes on March 6, 2019 by xi'an## Lindley’s paradox as a loss of resolution

Posted in Books, pictures, Statistics with tags AIC, Bayesian inference, Bayesian tests of hypotheses, BIC, Cross Validation, Inferential Models, Jeffreys-Lindley paradox, p-value, pseudo-Bayes factors on November 9, 2016 by xi'an

“The principle of indifference states that in the absence of prior information, all mutually exclusive models should be assigned equal prior probability.”

**C**olin LaMont and Paul Wiggins arxived a paper on Lindley’s paradox a few days ago. The above quote is the (standard) argument for picking (½,½) partition between the two hypotheses, which I object to if only because it does not stand for multiple embedded models. The main point in the paper is to argue about the loss of resolution induced by averaging against the prior, as illustrated by the picture above for the N(0,1) versus N(μ,1) toy problem. What they call resolution is the lowest possible mean estimate for which the null is rejected by the Bayes factor (assuming a rejection for Bayes factors larger than 1). While the detail is missing, I presume the different curves on the lower panel correspond to different choices of L when using U(-L,L) priors on μ… The “Bayesian rejoinder” to the Lindley-Bartlett paradox (p.4) is in tune with my interpretation, namely that as the prior mass under the alternative gets more and more spread out, there is less and less prior support for reasonable values of the parameter, hence a growing tendency to accept the null. This is an illustration of the long-lasting impact of the prior on the posterior probability of the model, because the data cannot impact the tails very much.

“If the true prior is known, Bayesian inference using the true prior is optimal.”

This sentence and the arguments following is meaningless in my opinion as knowing the “true” prior makes the Bayesian debate superfluous. If there was a unique, Nature provided, known prior π, it would loose its original meaning to become part of the (frequentist) model. The argument is actually mostly used in negative, namely that since it is not know we should not follow a Bayesian approach: this is, e.g., the main criticism in Inferential Models. But there is no such thing as a “true” prior! (Or a “true’ model, all things considered!) In the current paper, this pseudo-natural approach to priors is utilised to justify a return to the pseudo-Bayes factors of the 1990’s, when one part of the data is used to stabilise and proper-ise the (improper) prior, and a second part to run the test *per se*. This includes an interesting insight on the limiting cases of partitioning corresponding to AIC and BIC, respectively, that I had not seen before. With the surprising conclusion that “AIC is the derivative of BIC”!

## at CIRM [#2]

Posted in Mountains, pictures, Running, Statistics, Travel, University life with tags approximate MCMC, Cap Morgiou, CIRM, clustering, EM algorithm, functional programming languages, improper priors, Jeffreys-Lindley paradox, La Grande Candelle, Luminy, Marseille, Monte Carlo Statistical Methods, noisy MCMC, pGibbs, pMCMC, q-vague convergence, trail running, variable selection, yeast on March 2, 2016 by xi'an**S**ylvia Richardson gave a great talk yesterday on clustering applied to variable selection, which first raised [in me] a usual worry of the lack of background model for clustering. But the way she used this notion meant there was an infinite Dirichlet process mixture model behind. This is quite novel [at least for me!] in that it addresses the covariates and not the observations themselves. I still wonder at the meaning of the cluster as, if I understood properly, the dependent variable is not involved in the clustering. Check her R package PReMiuM for a practical implementation of the approach. Later, Adeline Samson showed us the results of using pMCM versus particle Gibbs for diffusion processes where (a) pMCMC was behaving much worse than particle Gibbs and (b) EM required very few particles and Metropolis-Hastings steps to achieve convergence, when compared with posterior approximations.

Today Pierre Druilhet explained to the audience of the summer school his measure theoretic approach [I discussed a while ago] to the limit of proper priors via *q-vague* convergence, with the paradoxical phenomenon that a Be(n⁻¹,n⁻¹) converges to a sum of two Dirac masses when the parameter space is [0,1] but to Haldane’s prior when the space is (0,1)! He also explained why the Jeffreys-Lindley paradox vanishes when considering different measures [with an illustration that came from my Statistica Sinica 1993 paper]. Pierre concluded with the above opposition between two Bayesian paradigms, a [sort of] tale of two sigma [fields]! Not that I necessarily agree with the first paradigm that priors are supposed to have generated the actual parameter. If only because it mechanistically excludes all improper priors…

Darren Wilkinson talked about yeast, which is orders of magnitude more exciting than it sounds, because this is Bayesian big data analysis in action! With significant (and hence impressive) results based on stochastic dynamic models. And massive variable selection techniques. Scala, Haskell, Frege, OCaml were [functional] languages he mentioned that I had never heard of before! And Daniel Rudolf concluded the [intense] second day of this Bayesian week at CIRM with a description of his convergence results for (rather controlled) noisy MCMC algorithms.

## normality test with 10⁸ observations?

Posted in Books, pictures, Statistics, Travel, University life with tags Bayes factor, Jeffreys-Lindley paradox, normal number on February 23, 2016 by xi'an**Q**uentin Gronau and Eric-Jan Wagenmakers just arXived a rather exotic paper in that it merges experimental mathematics with Bayesian inference. The mathematical question at stake here is whether or not one of the classical irrational constants like π, e or √2 are “normal”, that is, have the same frequency for all digits in their decimal expansion. This (still) is an open problem in mathematics. Indeed, the authors do not provide a definitive answer but instead run a Bayesian testing experiment on 100 million digits on π, ending up with a Bayes factor of 2×10³¹. The figure is massive, however one must account for the number of “observations” in the sample. (Which is not a statistical sample, strictly speaking.) While I do not think the argument will convince an algebraist (as the counterargument of knowing nothing about digits after the 10⁸th one is easy to formulate!), I am also uncertain of the relevance of this huge figure, as I am unable to justify a prior on the distribution of digits if the number is not normal. Since we do not even know whether there are ~~non-~~normal numbers outside rational numbers. While the flat Dirichlet prior is a uniform prior over the simplex, to assume that all possible probability repartitions are equally possible may not appeal to a mathematician, as far as I [do not] know! Furthermore, the multinomial model imposed on (at?) the series of digit of π does not have to agree with this “data” and discrepancies may as well be due to a poor sampling model as to an inappropriate prior. The data may more agree with H⁰ than with H¹ because the sampling model in H¹ is ill-suited. The paper also considers a second prior (or posterior prior) that I do not find particularly relevant.

For all I [do not] know, the huge value of the Bayes factor may be another avatar of the Lindley-Jeffreys paradox. In the sense of my interpretation of the phenomenon as a dilution of the prior mass over an unrealistically large space. Actually, the authors mention the paradox as well (p.5) but seemingly as a criticism of a frequentist approach. The picture above has its lower bound determined by a virtual dataset that produces a χ² statistic equal to the 95% χ² quantile. Dataset that stills produces a fairly high Bayes factor. (The discussion seems to assume that the Bayes factor is a one-to-one function of the χ² statistics, which is not correct I think. I wonder if exactly 95% of the sequence of Bayes factors stays within this band. There is no theoretical reason for this to happen of course.) Hence an illustration of the Lindley-Jeffreys paradox indeed, in its first interpretation of the clash between conclusions based on both paradigms. As a conclusion, I am thus not terribly convinced that this experiment supports the use of a Bayes factor for solving this normality hypothesis. Not that I support the alternative use of the p-value of course! As a sidenote, the pdf file I downloaded from arXiv has a slight bug that interacted badly with my printer in Warwick, as shown in the picture above.