Archive for Harold Jeffreys

covariant priors, Jeffreys and paradoxes

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on February 9, 2016 by xi'an

“If no information is available, π(α|M) must not deliver information about α.”

In a recent arXival apparently submitted to Bayesian Analysis, Giovanni Mana and Carlo Palmisano discuss of the choice of priors in metrology. Which reminded me of this meeting I attended at the Bureau des Poids et Mesures in Sèvres where similar debates took place, albeit being led by ferocious anti-Bayesians! Their reference prior appears to be the Jeffreys prior, because of its reparameterisation invariance.

“The relevance of the Jeffreys rule in metrology and in expressing uncertainties in measurements resides in the metric invariance.”

This, along with a second order approximation to the Kullback-Leibler divergence, is indeed one reason for advocating the use of a Jeffreys prior. I at first found it surprising that the (usually improper) prior is used in a marginal likelihood, as it cannot be normalised. A source of much debate [and of our alternative proposal].

“To make a meaningful posterior distribution and uncertainty assessment, the prior density must be covariant; that is, the prior distributions of different parameterizations must be obtained by transformations of variables. Furthermore, it is necessary that the prior densities are proper.”

The above quote is quite interesting both in that the notion of covariant is used rather than invariant or equivariant. And in that properness is indicated as a requirement. (Even more surprising is the noun associated with covariant, since it clashes with the usual notion of covariance!) They conclude that the marginal associated with an improper prior is null because the normalising constant of the prior is infinite.

“…the posterior probability of a selected model must not be null; therefore, improper priors are not allowed.”

Maybe not so surprisingly given this stance on improper priors, the authors cover a collection of “paradoxes” in their final and longest section: most of which makes little sense to me. First, they point out that the reference priors of Berger, Bernardo and Sun (2015) are not invariant, but this should not come as a surprise given that they focus on parameters of interest versus nuisance parameters. The second issue pointed out by the authors is that under Jeffreys’ prior, the posterior distribution of a given normal mean for n observations is a t with n degrees of freedom while it is a t with n-1 degrees of freedom from a frequentist perspective. This is not such a paradox since both distributions work in different spaces. Further, unless I am confused, this is one of the marginalisation paradoxes, which more straightforward explanation is that marginalisation is not meaningful for improper priors. A third paradox relates to a contingency table with a large number of cells, in that the posterior mean of a cell probability goes as the number of cells goes to infinity. (In this case, Jeffreys’ prior is proper.) Again not much of a bummer, there is simply not enough information in the data when faced with a infinite number of parameters. Paradox #4 is the Stein paradox, when estimating the squared norm of a normal mean. Jeffreys’ prior then leads to a constant bias that increases with the dimension of the vector. Definitely a bad point for Jeffreys’ prior, except that there is no Bayes estimator in such a case, the Bayes risk being infinite. Using a renormalised loss function solves the issue, rather than introducing as in the paper uniform priors on intervals, which require hyperpriors without being particularly compelling. The fifth paradox is the Neyman-Scott problem, with again the Jeffreys prior the culprit since the estimator of the variance is inconsistent. By a multiplicative factor of 2. Another stone in Jeffreys’ garden [of forking paths!]. The authors consider that the prior gives zero weight to any interval not containing zero, as if it was a proper probability distribution. And “solve” the problem by avoid zero altogether, which requires of course to specify a lower bound on the variance. And then introducing another (improper) Jeffreys prior on that bound… The last and final paradox mentioned in this paper is one of the marginalisation paradoxes, with a bizarre explanation that since the mean and variance μ and σ are not independent a posteriori, “the information delivered by x̄ should not be neglected”.

on the origin of the Bayes factor

Posted in Books, Statistics with tags , , , , , , , on November 27, 2015 by xi'an

Alexander Etz and Eric-Jan Wagenmakers from the Department of Psychology of the University of Amsterdam just arXived a paper on the invention of the Bayes factor. In particular, they highlight the role of John Burdon Sanderson (J.B.S.) Haldane in the use of the central tool for Bayesian comparison of hypotheses. In short, Haldane used a Bayes factor before Jeffreys did!

“The idea of a significance test, I suppose, putting half the probability into a constant being 0, and distributing the other half over a range of possible values.”H. Jeffreys

The authors analyse Jeffreys’ 1935 paper on significance tests, which appears to be the very first occurrence of a Bayes factor in his bibliography, testing whether or not two probabilities are equal. They also show the roots of this derivation in earlier papers by Dorothy Wrinch and Harold Jeffreys. [As an “aside”, the early contributions of Dorothy Wrinch to the foundations of 20th Century Bayesian statistics are hardly acknowledged. A shame, when considering they constitute the basis and more of Jeffreys’ 1931 Scientific Inference, Jeffreys who wrote in her necrology “I should like to put on record my appreciation of the substantial contribution she made to [our joint] work, which is the basis of all my later work on scientific inference.” In retrospect, Dorothy Wrinch should have been co-author to this book…] As early as 1919. These early papers by Wrinch and Jeffreys are foundational in that they elaborate a construction of prior distributions that will eventually see the Jeffreys non-informative prior as its final solution [Jeffreys priors that should be called Lhostes priors according to Steve Fienberg, although I think Ernest Lhoste only considered a limited number of transformations in his invariance rule]. The 1921 paper contains de facto the Bayes factor but it does not appear to be advocated as a tool per se for conducting significance tests.

“The historical records suggest that Haldane calculated the first Bayes factor, perhaps almost by accident, before Jeffreys did.” A. Etz and E.J. Wagenmakers

As another interesting aside, the historical account points out that Jeffreys came out in 1931 with what is now called Haldane’s prior for a Binomial proportion, proposed in 1931 (when the paper was read) and in 1932 (when the paper was published in the Mathematical Proceedings of the Cambridge Philosophical Society) by Haldane. The problem tackled by Haldane is again a significance on a Binomial probability. Contrary to the authors, I find the original (quoted) text quite clear, with a prior split before a uniform on [0,½] and a point mass at ½. Haldane uses a posterior odd [of 34.7] to compare both hypotheses but… I see no trace in the quoted material that he ends up using the Bayes factor as such, that is as his decision rule. (I acknowledge decision rule is anachronistic in this setting.) On the side, Haldane also implements model averaging. Hence my reading of this reading of the 1930’s literature is that it remains unclear that Haldane perceived the Bayes factor as a Bayesian [another anachronism] inference tool, upon which [and only which] significance tests could be conducted. That Haldane had a remarkably modern view of splitting the prior according to two orthogonal measures and of correctly deriving the posterior odds is quite clear. With the very neat trick of removing the infinite integral at p=0, an issue that Jeffreys was fighting with at the same time. In conclusion, I would thus rephrase the major finding of this paper as Haldane should get the priority in deriving the Bayesian significance test for point null hypotheses, rather than in deriving the Bayes factor. But this may be my biased views of Bayes factors speaking there…

Another amazing fact I gathered from the historical work of Etz and Wagenmakers is that Haldane and Jeffreys were geographically very close while working on the same problem and hence should have known and referenced their respective works. Which did not happen.

re-revisiting Jeffreys

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , on October 16, 2015 by xi'an

Amster12Analytic Posteriors for Pearson’s Correlation Coefficient was arXived yesterday by Alexander Ly , Maarten Marsman, and Eric-Jan Wagenmakers from Amsterdam, with whom I recently had two most enjoyable encounters (and dinners!). And whose paper on Jeffreys’ Theory of Probability I recently discussed in the Journal of Mathematical Psychology.

The paper re-analyses Bayesian inference on the Gaussian correlation coefficient, demonstrating that for standard reference priors the posterior moments are (surprisingly) available in closed form. Including priors suggested by Jeffreys (in a 1935 paper), Lindley, Bayarri (Susie’s first paper!), Berger, Bernardo, and Sun. They all are of the form

\pi(\theta)\propto(1+\rho^2)^\alpha(1-\rho^2)^\beta\sigma_1^\gamma\sigma_2^\delta

and the corresponding profile likelihood on ρ is in “closed” form (“closed” because it involves hypergeometric functions). And only depends on the sample correlation which is then marginally sufficient (although I do not like this notion!). The posterior moments associated with those priors can be expressed as series (of hypergeometric functions). While the paper is very technical, borrowing from the Bateman project and from Gradshteyn and Ryzhik, I like it if only because it reminds me of some early papers I wrote in the same vein, Abramowitz and Stegun being one of the very first books I bought (at a ridiculous price in the bookstore of Purdue University…).

Two comments about the paper: I see nowhere a condition for the posterior to be proper, although I assume it could be the n>1+γ−2α+δ constraint found in Corollary 2.1 (although I am surprised there is no condition on the coefficient β). The second thing is about the use of this analytic expression in simulations from the marginal posterior on ρ: Since the density is available, numerical integration is certainly more efficient than Monte Carlo integration [for quantities that are not already available in closed form]. Furthermore, in the general case when β is not zero, the cost of computing infinite series of hypergeometric and gamma functions maybe counterbalanced by a direct simulation of ρ and both variance parameters since the profile likelihood of this triplet is truly in closed form, see eqn (2.11). And I will not comment the fact that Fisher ends up being the most quoted author in the paper!

the (expected) demise of the Bayes factor [#2]

Posted in Books, Kids, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , on July 1, 2015 by xi'an

AmsterXXFollowing my earlier comments on Alexander Ly, Josine Verhagen, and Eric-Jan Wagenmakers, from Amsterdam, Joris Mulder, a special issue editor of the Journal of Mathematical Psychology, kindly asked me for a written discussion of that paper, discussion that I wrote last week and arXived this weekend. Besides the above comments on ToP, this discussion contains some of my usual arguments against the use of the Bayes factor as well as a short introduction to our recent proposal via mixtures. Short introduction as I had to restrain myself from reproducing the arguments in the original paper, for fear it would jeopardize its chances of getting published and, who knows?, discussed.

the maths of Jeffreys-Lindley paradox

Posted in Books, Kids, Statistics with tags , , , , , , , on March 26, 2015 by xi'an

75b18-imagedusparadrapCristiano Villa and Stephen Walker arXived on last Friday a paper entitled On the mathematics of the Jeffreys-Lindley paradox. Following the philosophical papers of last year, by Ari Spanos, Jan Sprenger, Guillaume Rochefort-Maranda, and myself, this provides a more statistical view on the paradox. Or “paradox”… Even though I strongly disagree with the conclusion, namely that a finite (prior) variance σ² should be used in the Gaussian prior. And fall back on classical Type I and Type II errors. So, in that sense, the authors avoid the Jeffreys-Lindley paradox altogether!

The argument against considering a limiting value for the posterior probability is that it converges to 0, 21, or an intermediate value. In the first two cases it is useless. In the medium case. achieved when the prior probability of the null and alternative hypotheses depend on variance σ². While I do not want to argue in favour of my 1993 solution

\rho(\sigma) = 1\big/ 1+\sqrt{2\pi}\sigma

since it is ill-defined in measure theoretic terms, I do not buy the coherence argument that, since this prior probability converges to zero when σ² goes to infinity, the posterior probability should also go to zero. In the limit, probabilistic reasoning fails since the prior under the alternative is a measure not a probability distribution… We should thus abstain from over-interpreting improper priors. (A sin sometimes committed by Jeffreys himself in his book!)

Is Jeffreys’ prior unique?

Posted in Books, Statistics, University life with tags , , , , , on March 3, 2015 by xi'an

“A striking characterisation showing the central importance of Fisher’s information in a differential framework is due to Cencov (1972), who shows that it is the only invariant Riemannian metric under symmetry conditions.” N. Polson, PhD Thesis, University of Nottingham, 1988

Following a discussion on Cross Validated, I wonder whether or not the affirmation that Jeffreys’ prior was the only prior construction rule that remains invariant under arbitrary (if smooth enough) reparameterisation. In the discussion, Paulo Marques mentioned Nikolaj Nikolaevič Čencov’s book, Statistical Decision Rules and Optimal Inference, Russian book from 1972, of which I had not heard previously and which seems too theoretical [from Paulo’s comments] to explain why this rule would be the sole one. As I kept looking for Čencov’s references on the Web, I found Nick Polson’s thesis and the above quote. So maybe Nick could tell us more!

However, my uncertainty about the uniqueness of Jeffreys’ rule stems from the fact that, f I decide on a favourite or reference parametrisation—as Jeffreys indirectly does when selecting the parametrisation associated with a constant Fisher information—and on a prior derivation from the sampling distribution for this parametrisation, I have derived a parametrisation invariant principle. Possibly silly and uninteresting from a Bayesian viewpoint but nonetheless invariant.

not Bayesian enough?!

Posted in Books, Statistics, University life with tags , , , , , , , on January 23, 2015 by xi'an

Elm tree in the park, Parc de Sceaux, Nov. 22, 2011Our random forest paper was alas rejected last week. Alas because I think the approach is a significant advance in ABC methodology when implemented for model choice, avoiding the delicate selection of summary statistics and the report of shaky posterior probability approximation. Alas also because the referees somewhat missed the point, apparently perceiving random forests as a way to project a large collection of summary statistics on a limited dimensional vector as in the Read Paper of Paul Fearnhead and Dennis Prarngle, while the central point in using random forests is the avoidance of a selection or projection of summary statistics.  They also dismissed ou approach based on the argument that the reduction in error rate brought by random forests over LDA or standard (k-nn) ABC is “marginal”, which indicates a degree of misunderstanding of what the classification error stand for in machine learning: the maximum possible gain in supervised learning with a large number of classes cannot be brought arbitrarily close to zero. Last but not least, the referees did not appreciate why we mostly cannot trust posterior probabilities produced by ABC model choice and hence why the posterior error loss is a valuable and almost inevitable machine learning alternative, dismissing the posterior expected loss as being not Bayesian enough (or at all), for “averaging over hypothetical datasets” (which is a replicate of Jeffreys‘ famous criticism of p-values)! Certainly a first time for me to be rejected based on this argument!