## on the origin of the Bayes factor

Posted in Books, Statistics with tags , , , , , , , on November 27, 2015 by xi'an

Alexander Etz and Eric-Jan Wagenmakers from the Department of Psychology of the University of Amsterdam just arXived a paper on the invention of the Bayes factor. In particular, they highlight the role of John Burdon Sanderson (J.B.S.) Haldane in the use of the central tool for Bayesian comparison of hypotheses. In short, Haldane used a Bayes factor before Jeffreys did!

“The idea of a significance test, I suppose, putting half the probability into a constant being 0, and distributing the other half over a range of possible values.”H. Jeffreys

The authors analyse Jeffreys’ 1935 paper on significance tests, which appears to be the very first occurrence of a Bayes factor in his bibliography, testing whether or not two probabilities are equal. They also show the roots of this derivation in earlier papers by Dorothy Wrinch and Harold Jeffreys. [As an “aside”, the early contributions of Dorothy Wrinch to the foundations of 20th Century Bayesian statistics are hardly acknowledged. A shame, when considering they constitute the basis and more of Jeffreys’ 1931 Scientific Inference, Jeffreys who wrote in her necrology “I should like to put on record my appreciation of the substantial contribution she made to [our joint] work, which is the basis of all my later work on scientific inference.” In retrospect, Dorothy Wrinch should have been co-author to this book…] As early as 1919. These early papers by Wrinch and Jeffreys are foundational in that they elaborate a construction of prior distributions that will eventually see the Jeffreys non-informative prior as its final solution [Jeffreys priors that should be called Lhostes priors according to Steve Fienberg, although I think Ernest Lhoste only considered a limited number of transformations in his invariance rule]. The 1921 paper contains de facto the Bayes factor but it does not appear to be advocated as a tool per se for conducting significance tests.

“The historical records suggest that Haldane calculated the first Bayes factor, perhaps almost by accident, before Jeffreys did.” A. Etz and E.J. Wagenmakers

As another interesting aside, the historical account points out that Jeffreys came out in 1931 with what is now called Haldane’s prior for a Binomial proportion, proposed in 1931 (when the paper was read) and in 1932 (when the paper was published in the Mathematical Proceedings of the Cambridge Philosophical Society) by Haldane. The problem tackled by Haldane is again a significance on a Binomial probability. Contrary to the authors, I find the original (quoted) text quite clear, with a prior split before a uniform on [0,½] and a point mass at ½. Haldane uses a posterior odd [of 34.7] to compare both hypotheses but… I see no trace in the quoted material that he ends up using the Bayes factor as such, that is as his decision rule. (I acknowledge decision rule is anachronistic in this setting.) On the side, Haldane also implements model averaging. Hence my reading of this reading of the 1930’s literature is that it remains unclear that Haldane perceived the Bayes factor as a Bayesian [another anachronism] inference tool, upon which [and only which] significance tests could be conducted. That Haldane had a remarkably modern view of splitting the prior according to two orthogonal measures and of correctly deriving the posterior odds is quite clear. With the very neat trick of removing the infinite integral at p=0, an issue that Jeffreys was fighting with at the same time. In conclusion, I would thus rephrase the major finding of this paper as Haldane should get the priority in deriving the Bayesian significance test for point null hypotheses, rather than in deriving the Bayes factor. But this may be my biased views of Bayes factors speaking there…

Another amazing fact I gathered from the historical work of Etz and Wagenmakers is that Haldane and Jeffreys were geographically very close while working on the same problem and hence should have known and referenced their respective works. Which did not happen.

## re-revisiting Jeffreys

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , on October 16, 2015 by xi'an

Analytic Posteriors for Pearson’s Correlation Coefficient was arXived yesterday by Alexander Ly , Maarten Marsman, and Eric-Jan Wagenmakers from Amsterdam, with whom I recently had two most enjoyable encounters (and dinners!). And whose paper on Jeffreys’ Theory of Probability I recently discussed in the Journal of Mathematical Psychology.

The paper re-analyses Bayesian inference on the Gaussian correlation coefficient, demonstrating that for standard reference priors the posterior moments are (surprisingly) available in closed form. Including priors suggested by Jeffreys (in a 1935 paper), Lindley, Bayarri (Susie’s first paper!), Berger, Bernardo, and Sun. They all are of the form

$\pi(\theta)\propto(1+\rho^2)^\alpha(1-\rho^2)^\beta\sigma_1^\gamma\sigma_2^\delta$

and the corresponding profile likelihood on ρ is in “closed” form (“closed” because it involves hypergeometric functions). And only depends on the sample correlation which is then marginally sufficient (although I do not like this notion!). The posterior moments associated with those priors can be expressed as series (of hypergeometric functions). While the paper is very technical, borrowing from the Bateman project and from Gradshteyn and Ryzhik, I like it if only because it reminds me of some early papers I wrote in the same vein, Abramowitz and Stegun being one of the very first books I bought (at a ridiculous price in the bookstore of Purdue University…).

Two comments about the paper: I see nowhere a condition for the posterior to be proper, although I assume it could be the n>1+γ−2α+δ constraint found in Corollary 2.1 (although I am surprised there is no condition on the coefficient β). The second thing is about the use of this analytic expression in simulations from the marginal posterior on ρ: Since the density is available, numerical integration is certainly more efficient than Monte Carlo integration [for quantities that are not already available in closed form]. Furthermore, in the general case when β is not zero, the cost of computing infinite series of hypergeometric and gamma functions maybe counterbalanced by a direct simulation of ρ and both variance parameters since the profile likelihood of this triplet is truly in closed form, see eqn (2.11). And I will not comment the fact that Fisher ends up being the most quoted author in the paper!

## the (expected) demise of the Bayes factor [#2]

Posted in Books, Kids, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , on July 1, 2015 by xi'an

Following my earlier comments on Alexander Ly, Josine Verhagen, and Eric-Jan Wagenmakers, from Amsterdam, Joris Mulder, a special issue editor of the Journal of Mathematical Psychology, kindly asked me for a written discussion of that paper, discussion that I wrote last week and arXived this weekend. Besides the above comments on ToP, this discussion contains some of my usual arguments against the use of the Bayes factor as well as a short introduction to our recent proposal via mixtures. Short introduction as I had to restrain myself from reproducing the arguments in the original paper, for fear it would jeopardize its chances of getting published and, who knows?, discussed.

## the maths of Jeffreys-Lindley paradox

Posted in Books, Kids, Statistics with tags , , , , , , , on March 26, 2015 by xi'an

Cristiano Villa and Stephen Walker arXived on last Friday a paper entitled On the mathematics of the Jeffreys-Lindley paradox. Following the philosophical papers of last year, by Ari Spanos, Jan Sprenger, Guillaume Rochefort-Maranda, and myself, this provides a more statistical view on the paradox. Or “paradox”… Even though I strongly disagree with the conclusion, namely that a finite (prior) variance σ² should be used in the Gaussian prior. And fall back on classical Type I and Type II errors. So, in that sense, the authors avoid the Jeffreys-Lindley paradox altogether!

The argument against considering a limiting value for the posterior probability is that it converges to 0, 21, or an intermediate value. In the first two cases it is useless. In the medium case. achieved when the prior probability of the null and alternative hypotheses depend on variance σ². While I do not want to argue in favour of my 1993 solution

$\rho(\sigma) = 1\big/ 1+\sqrt{2\pi}\sigma$

since it is ill-defined in measure theoretic terms, I do not buy the coherence argument that, since this prior probability converges to zero when σ² goes to infinity, the posterior probability should also go to zero. In the limit, probabilistic reasoning fails since the prior under the alternative is a measure not a probability distribution… We should thus abstain from over-interpreting improper priors. (A sin sometimes committed by Jeffreys himself in his book!)

## Is Jeffreys’ prior unique?

Posted in Books, Statistics, University life with tags , , , , , on March 3, 2015 by xi'an

“A striking characterisation showing the central importance of Fisher’s information in a differential framework is due to Cencov (1972), who shows that it is the only invariant Riemannian metric under symmetry conditions.” N. Polson, PhD Thesis, University of Nottingham, 1988

Following a discussion on Cross Validated, I wonder whether or not the affirmation that Jeffreys’ prior was the only prior construction rule that remains invariant under arbitrary (if smooth enough) reparameterisation. In the discussion, Paulo Marques mentioned Nikolaj Nikolaevič Čencov’s book, Statistical Decision Rules and Optimal Inference, Russian book from 1972, of which I had not heard previously and which seems too theoretical [from Paulo’s comments] to explain why this rule would be the sole one. As I kept looking for Čencov’s references on the Web, I found Nick Polson’s thesis and the above quote. So maybe Nick could tell us more!

However, my uncertainty about the uniqueness of Jeffreys’ rule stems from the fact that, f I decide on a favourite or reference parametrisation—as Jeffreys indirectly does when selecting the parametrisation associated with a constant Fisher information—and on a prior derivation from the sampling distribution for this parametrisation, I have derived a parametrisation invariant principle. Possibly silly and uninteresting from a Bayesian viewpoint but nonetheless invariant.

## not Bayesian enough?!

Posted in Books, Statistics, University life with tags , , , , , , , on January 23, 2015 by xi'an

Our random forest paper was alas rejected last week. Alas because I think the approach is a significant advance in ABC methodology when implemented for model choice, avoiding the delicate selection of summary statistics and the report of shaky posterior probability approximation. Alas also because the referees somewhat missed the point, apparently perceiving random forests as a way to project a large collection of summary statistics on a limited dimensional vector as in the Read Paper of Paul Fearnhead and Dennis Prarngle, while the central point in using random forests is the avoidance of a selection or projection of summary statistics.  They also dismissed ou approach based on the argument that the reduction in error rate brought by random forests over LDA or standard (k-nn) ABC is “marginal”, which indicates a degree of misunderstanding of what the classification error stand for in machine learning: the maximum possible gain in supervised learning with a large number of classes cannot be brought arbitrarily close to zero. Last but not least, the referees did not appreciate why we mostly cannot trust posterior probabilities produced by ABC model choice and hence why the posterior error loss is a valuable and almost inevitable machine learning alternative, dismissing the posterior expected loss as being not Bayesian enough (or at all), for “averaging over hypothetical datasets” (which is a replicate of Jeffreys‘ famous criticism of p-values)! Certainly a first time for me to be rejected based on this argument!

## Harold Jeffreys’ default Bayes factor [for psychologists]

Posted in Books, Statistics, University life with tags , , , , , , on January 16, 2015 by xi'an

“One of Jeffreys’ goals was to create default Bayes factors by using prior distributions that obeyed a series of general desiderata.”

The paper Harold Jeffreys’s default Bayes factor hypothesis tests: explanation, extension, and application in Psychology by Alexander Ly, Josine Verhagen, and Eric-Jan Wagenmakers is both a survey and a reinterpretation cum explanation of Harold Jeffreys‘ views on testing. At about the same time, I received a copy from Alexander and a copy from the journal it had been submitted to! This work starts with a short historical entry on Jeffreys’ work and career, which includes four of his principles, quoted verbatim from the paper:

1. “scientific progress depends primarily on induction”;
2. “in order to formalize induction one requires a logic of partial belief” [enters the Bayesian paradigm];
3. “scientific hypotheses can be assigned prior plausibility in accordance with their complexity” [a.k.a., Occam’s razor];
4. “classical “Fisherian” p-values are inadequate for the purpose of hypothesis testing”.

“The choice of π(σ)  therefore irrelevant for the Bayes factor as long as we use the same weighting function in both models”

A very relevant point made by the authors is that Jeffreys only considered embedded or nested hypotheses, a fact that allows for having common parameters between models and hence some form of reference prior. Even though (a) I dislike the notion of “common” parameters and (b) I do not think it is entirely legit (I was going to write proper!) from a mathematical viewpoint to use the same (improper) prior on both sides, as discussed in our Statistical Science paper. And in our most recent alternative proposal. The most delicate issue however is to derive a reference prior on the parameter of interest, which is fixed under the null and unknown under the alternative. Hence preventing the use of improper priors. Jeffreys tried to calibrate the corresponding prior by imposing asymptotic consistency under the alternative. And exact indeterminacy under “completely uninformative” data. Unfortunately, this is not a well-defined notion. In the normal example, the authors recall and follow the proposal of Jeffreys to use an improper prior π(σ)∝1/σ on the nuisance parameter and argue in his defence the quote above. I find this argument quite weak because suddenly the prior on σ becomes a weighting function... A notion foreign to the Bayesian cosmology. If we use an improper prior for π(σ), the marginal likelihood on the data is no longer a probability density and I do not buy the argument that one should use the same measure with the same constant both on σ alone [for the nested hypothesis] and on the σ part of (μ,σ) [for the nesting hypothesis]. We are considering two spaces with different dimensions and hence orthogonal measures. This quote thus sounds more like wishful thinking than like a justification. Similarly, the assumption of independence between δ=μ/σ and σ does not make sense for σ-finite measures. Note that the authors later point out that (a) the posterior on σ varies between models despite using the same data [which shows that the parameter σ is far from common to both models!] and (b) the [testing] Cauchy prior on δ is only useful for the testing part and should be replaced with another [estimation] prior when the model has been selected. Which may end up as a backfiring argument about this default choice.

“Each updated weighting function should be interpreted as a posterior in estimating σ within their own context, the model.”

The re-derivation of Jeffreys’ conclusion that a Cauchy prior should be used on δ=μ/σ makes it clear that this choice only proceeds from an imperative of fat tails in the prior, without solving the calibration of the Cauchy scale. (Given the now-available modern computing tools, it would be nice to see the impact of this scale γ on the numerical value of the Bayes factor.) And maybe it also proceeds from a “hidden agenda” to achieve a Bayes factor that solely depends on the t statistic. Although this does not sound like a compelling reason to me, since the t statistic is not sufficient in this setting.

In a differently interesting way, the authors mention the Savage-Dickey ratio (p.16) as a way to represent the Bayes factor for nested models, without necessarily perceiving the mathematical difficulty with this ratio that we pointed out a few years ago. For instance, in the psychology example processed in the paper, the test is between δ=0 and δ≥0; however, if I set π(δ=0)=0 under the alternative prior, which should not matter [from a measure-theoretic perspective where the density is uniquely defined almost everywhere], the Savage-Dickey representation of the Bayes factor returns zero, instead of 9.18!

“In general, the fact that different priors result in different Bayes factors should not come as a surprise.”

The second example detailed in the paper is the test for a zero Gaussian correlation. This is a sort of “ideal case” in that the parameter of interest is between -1 and 1, hence makes the choice of a uniform U(-1,1) easy or easier to argue. Furthermore, the setting is also “ideal” in that the Bayes factor simplifies down into a marginal over the sample correlation only, under the usual Jeffreys priors on means and variances. So we have a second case where the frequentist statistic behind the frequentist test[ing procedure] is also the single (and insufficient) part of the data used in the Bayesian test[ing procedure]. Once again, we are in a setting where Bayesian and frequentist answers are in one-to-one correspondence (at least for a fixed sample size). And where the Bayes factor allows for a closed form through hypergeometric functions. Even in the one-sided case. (This is a result obtained by the authors, not by Jeffreys who, as the proper physicist he was, obtained approximations that are remarkably accurate!)

“The fact that the Bayes factor is independent of the intention with which the data have been collected is of considerable practical importance.”

The authors have a side argument in this section in favour of the Bayes factor against the p-value, namely that the “Bayes factor does not depend on the sampling plan” (p.29), but I find this fairly weak (or tongue in cheek) as the Bayes factor does depend on the sampling distribution imposed on top of the data. It appears that the argument is mostly used to defend sequential testing.

“The Bayes factor (…) balances the tension between parsimony and goodness of fit, (…) against overfitting the data.”

In fine, I liked very much this re-reading of Jeffreys’ approach to testing, maybe the more because I now think we should get away from it! I am not certain it will help in convincing psychologists to adopt Bayes factors for assessing their experiments as it may instead frighten them away. And it does not bring an answer to the vexing issue of the relevance of point null hypotheses. But it constitutes a lucid and innovative of the major advance represented by Jeffreys’ formalisation of Bayesian testing.