**F**ollowing my earlier comments on Alexander Ly, Josine Verhagen, and Eric-Jan Wagenmakers, from Amsterdam, Joris Mulder, a special issue editor of the *Journal of Mathematical Psychology,* kindly asked me for a written discussion of that paper, discussion that I wrote last week and arXived this weekend. Besides the above comments on ToP, this discussion contains some of my usual arguments against the use of the Bayes factor as well as a short introduction to our recent proposal via mixtures. Short introduction as I had to restrain myself from reproducing the arguments in the original paper, for fear it would jeopardize its chances of getting published and, who knows?, discussed.

## Archive for Bayes factor

## the (expected) demise of the Bayes factor [#2]

Posted in Books, Kids, pictures, Running, Statistics, Travel, University life with tags Amsterdam, Bayes factor, boat, Harold Jeffreys, Holland, Journal of Mathematical Psychology, psychometrics, sunrise, Theory of Probability, XXX on July 1, 2015 by xi'an## speed seminar-ing

Posted in Books, pictures, Statistics, Travel, University life, Wines with tags ABC, Bayes factor, Bayesian model choice, Bayesian testing, French cheese, French wines, Languedoc wines, Livarot, Mas Bruguière, Montpellier, Pic Saint Loup on May 20, 2015 by xi'an**Y**esterday, I made a quick afternoon trip to Montpellier as replacement of a seminar speaker who had cancelled at the last minute. Most obviously, I gave a talk about our “testing as mixture” proposal. And as previously, the talk generated a fair amount of discussion and feedback from the audience. Providing me with additional aspects to include in a revision of the paper. Whether or not the current submission is rejected, new points made and received during those seminars will have to get in a revised version as they definitely add to the appeal to the perspective. In that seminar, most of the discussion concentrated on the connection with *decisions* based on such a tool as the posterior distribution of the mixture weight(s). My argument for sticking with the posterior rather than providing a hard decision rule was that the message is indeed in arguing hard rules that end up mimicking the p- or b-values. And the catastrophic consequences of fishing for significance and the like. Producing instead a validation by simulating under each model pseudo-samples shows what to expect for each model under comparison. The argument did not really convince Jean-Michel Marin, I am afraid! Another point he raised was that we could instead use a distribution on α with support {0,1}, to avoid the encompassing model he felt was too far from the original models. However, this leads back to the Bayes factor as the weights in 0 and 1 are the marginal likelihoods, nothing more. However, this perspective on the classical approach has at least the appeal of completely validating the use of improper priors on common (nuisance or not) parameters. Pierre Pudlo also wondered why we could not conduct an analysis on the mixture of the likelihoods. Instead of the likelihood of the mixture. My first answer was that there was not enough information in the data for estimating the weight(s). A few more seconds of reflection led me to the further argument that the posterior on α with support (0,1) would then be a mixture of Be(2,1) and Be(1,2) with weights the marginal likelihoods, again (under a uniform prior on α). So indeed not much to gain. A last point we discussed was the case of the evolution trees we analyse with population geneticists from the neighbourhood (and with ABC). Jean-Michel’s argument was that the scenari under comparison were not compatible with a mixture, the models being exclusive. My reply involved an admixture model that contained all scenarios as special cases. After a longer pondering, I think his objection was more about the non iid nature of the data. But the admixture construction remains valid. And makes a very strong case in favour of our approach, I believe.

After the seminar, Christian Lavergne and Jean-Michel had organised a doubly exceptional wine-and-cheese party: first because it is not usually the case there is such a post-seminar party and second because they had chosen a terrific series of wines from the Mas Bruguière (Pic Saint-Loup) vineyards. Ending up with a great 2007 L’Arbouse. Perfect ending for an exciting day. (I am not even mentioning a special Livarot from close to my home-town!)

## another view on Jeffreys-Lindley paradox

Posted in Books, Statistics, University life with tags Bayes factor, frequentist inference, Jeffreys-Lindley paradox, Philosophy of Science, Québec, Université Laval on January 15, 2015 by xi'an**I** found another paper on the Jeffreys-Lindley paradox. Entitled “A Misleading Intuition and the Bayesian Blind Spot: Revisiting the Jeffreys-Lindley’s Paradox”. Written by Guillaume Rochefort-Maranda, from Université Laval, Québec.

This paper starts by assuming an *unbiased* estimator of the parameter of interest θ and under test for the null θ=θ_{0}. (Which makes we wonder at the reason for imposing unbiasedness.) Another highly innovative (or puzzling) aspect is that the Lindley-Jeffreys paradox presented therein is described *without* *any* *Bayesian input*. The paper stands “within a frequentist (classical) framework”: it actually starts with a confidence-interval-on-θ-vs.-test argument to argue that, with a fixed coverage interval that excludes the null value θ_{0}, the estimate of θ may converge to θ_{0} without ever accepting the null θ=θ_{0}. That is, without the confidence interval *ever* containing θ_{0}. (Although this is an event whose probability converges to zero.) Bayesian aspects come later in the paper, even though the application to a point null *versus* a point null test is of little interest since a Bayes factor is then a likelihood ratio.

As I explained several times, including in my *Philosophy of Science* paper, I see the Lindley-Jeffreys paradox as being primarily a Bayesiano-Bayesian issue. So just the opposite of the perspective taken by the paper. That frequentist solutions differ does not strike me as paradoxical. Now, the construction of a sequence of samples such that *all *partial samples in the sequence exclude the null θ=θ_{0} is not a likely event, so I do not see this as a paradox even or especially when putting on my frequentist glasses: if the null θ=θ_{0} is true, this cannot happen in a consistent manner, even though a *single* occurrence of a p-value less than .05 is highly likely within such a sequence.

Unsurprisingly, the paper relates to the three most recent papers published by *Philosophy of Science*, discussing first and foremost Spanos‘ view. When the current author introduces Mayo and Spanos’ severity, i.e. the probability to exceed the observed test statistic under the alternative, he does not define this test statistic d(X), which makes the whole notion incomprehensible to a reader not already familiar with it. (And even for one familiar with it…)

“Hence, the solution I propose (…) avoids one of [Freeman’s] major disadvantages. I suggest that we should decrease the size of tests to the extent where it makes practically no difference to the power of the test in order to improve the likelihood ratio of a significant result.” (p.11)

One interesting if again unsurprising point in the paper is that one reason for the paradox stands in *keeping the significance level constant* as the sample size increases. While it is possible to decrease the significance level *and* to increase the power simultaneously. However, the solution proposed above does not sound rigorous hence I fail to understand how low the significance has to be for the method to stop/work. I cannot fathom a corresponding algorithmic derivation of the author’s proposal.

“I argue against the intuitive idea that a significant result given by a very powerful test is less convincing than a significant result given by a less powerful test.”

The criticism on the “blind spot” of the Bayesian approach is supported by an example where the data is issued from a distribution other than either of the two tested distributions. It seems reasonable that the Bayesian answer fails to provide a proper answer in this case. Even though it illustrates the difficulty with the long-term impact of the prior(s) in the Bayes factor and (in my opinion) the need to move away from this solution within the Bayesian paradigm.

## full Bayesian significance test

Posted in Statistics, Books with tags Bayes factor, Bayesian Analysis, Bayesian model choice, e-values, full Bayesian significance test, logic journal of the IGPL, measure theory, Murray Aitkin, p-values, São Paulo, statistical inference on December 18, 2014 by xi'an**A**mong the many comments (thanks!) I received when posting our Testing via mixture estimation paper came the suggestion to relate this approach to the notion of full Bayesian significance test (FBST) developed by (Julio, not Hal) Stern and Pereira, from São Paulo, Brazil. I thus had a look at this alternative and read the Bayesian Analysis paper they published in 2008, as well as a paper recently published in Logic Journal of IGPL. (I could not find what the IGPL stands for.) The central notion in these papers is the *e-value*, which provides the *posterior probability that the posterior density is larger than the largest posterior density over the null set*. This definition bothers me, first because the *null* set has a measure equal to zero under an absolutely continuous prior (BA, p.82). Hence the posterior density is defined in an arbitrary manner over the *null* set and the maximum is itself arbitrary. (An issue that invalidates my 1993 version of the Lindley-Jeffreys paradox!) And second because it considers the posterior probability of an event that does not exist a priori, being conditional on the data. This sounds in fact quite similar to *Statistical Inference*, Murray Aitkin’s (2009) book using a posterior distribution of the likelihood function. With the same drawback of using the data twice. And the other issues discussed in our commentary of the book. (As a side-much-on-the-side remark, the authors incidentally forgot me when citing our 1992 Annals of Statistics paper about decision theory on accuracy estimators..!)

## the demise of the Bayes factor

Posted in Books, Kids, Statistics, Travel, University life with tags Bayes factor, Bayesian hypothesis testing, component of a mixture, consistency, hyperparameter, model posterior probabilities, posterior, prior, testing as mixture estimation on December 8, 2014 by xi'an**W**ith Kaniav Kamary, Kerrie Mengersen, and Judith Rousseau, we have just arXived (and submitted) a paper entitled “Testing hypotheses via a mixture model”. (We actually presented some earlier version of this work in Cancũn, Vienna, and Gainesville, so you may have heard of it already.) The notion we advocate in this paper is to replace the posterior probability of a model or an hypothesis with the posterior distribution of the weights of a mixture of the models under comparison. That is, given two models under comparison,

we propose to estimate the (artificial) mixture model

and in particular derive the posterior distribution of α. One may object that the mixture model is neither of the two models under comparison but this is the case at the boundary, i.e., when α=0,1. Thus, if we use prior distributions on α that favour the neighbourhoods of 0 and 1, we should be able to see the posterior concentrate near 0 or 1, depending on which model is true. And indeed this is the case: for any given Beta prior on α, we observe a higher and higher concentration at the right boundary as the sample size increases. And establish a convergence result to this effect. Furthermore, the mixture approach offers numerous advantages, among which *[verbatim from the paper]*:

## posterior predictive distributions of Bayes factors

Posted in Books, Kids, Statistics with tags Bayes factor, Bayesian predictive, Bayesian tests, posterior predictive on October 8, 2014 by xi'an**O**nce a Bayes factor B(y) is computed, one needs to assess its strength. As repeated many times here, Jeffreys’ scale has no validation whatsoever, it is simply a division of the (1,∞) range into regions of convenience. Following earlier proposals in the literature (Box, 1980; García-Donato and Chen, 2005; Geweke and Amisano, 2008), an evaluation of this strength within the issue at stake, i.e. the comparison of two models, can be based on the predictive distribution. While most authors (like García-Donato and Chen) consider the prior predictive, I think using the posterior predictive distribution is more relevant since

- it exploits the information contained in the data y, thus concentrates on a region of relevance in the parameter space(s), which is especially interesting in weakly informative settings (even though we should abstain from testing in those cases, dixit Andrew);
- it reproduces the behaviour of the Bayes factor B(x) for values x of the observation similar to the original observation y;
- it does not hide issues of indeterminacy linked with improper priors: the Bayes factor B(x) remains indeterminate, even with a well-defined predictive;
- it does not separate between errors of type I and errors of type II but instead uses the natural summary provided by the Bayesian analysis, namely the predictive distribution π(x|y);
- as long as the evaluation is not used to reach a decision, there is no issue of “using the data twice”, we are simply producing an estimator of the posterior loss, for instance the (posterior) probability of selecting the wrong model. The Bayes factor B(x) is thus functionally independent of y, while x is probabilistically dependent on y.

Note that, even though probabilities of errors of type I and errors of type II can be computed, they fail to account for the posterior probabilities of both models. (This is the delicate issue with the solution of García-Donato and Chen.) Another nice feature is that the predictive distribution of the Bayes factor can be computed even in complex settings where ABC needs to be used.

## all models are wrong

Posted in Statistics, University life with tags ABC, Bayes factor, Bayesian model choice, George Box, model posterior probabilities, Molecular Ecology, phylogenetic model, phylogeography on September 27, 2014 by xi'an

“Using ABC to evaluate competing models has various hazards and comes with recommended precautions (Robert et al. 2011), and unsurprisingly, many if not most researchers have a healthy scepticism as these tools continue to mature.”

**M**ichael Hickerson just published an open-access letter with the above title in Molecular Ecology. (As in several earlier papers, incl. the (in)famous ones by Templeton, Hickerson confuses running an ABC algorithm with conducting Bayesian model comparison, but this is not the main point of this post.)

“Rather than using ABC with weighted model averaging to obtain the three corresponding posterior model probabilities while allowing for the handful of model parameters (θ, τ, γ, Μ) to be estimated under each model conditioned on each model’s posterior probability, these three models are sliced up into 143 ‘submodels’ according to various parameter ranges.”

**T**he letter is in fact a supporting argument for the earlier paper of Pelletier and Carstens (2014, Molecular Ecology) which conducted the above splitting experiment. I could not read this paper so cannot judge of the relevance of splitting this way the parameter range. From what I understand it amounts to using mutually exclusive priors by using different supports.

“Specifically, they demonstrate that as greater numbers of the 143 sub-models areevaluated, the inference from their ABC model choice procedure becomes increasingly.”

**A**n interestingly cut sentence. Increasingly unreliable? mediocre? weak?

“…with greater numbers of models being compared, the most probable models are assigned diminishing levels of posterior probability. This is an expected result…”

**T**rue, if the number of models under consideration increases, under a uniform prior over model indices, the posterior probability of a given model mechanically decreases. But the pairwise Bayes factors should not be impacted by the number of models under comparison and the letter by Hickerson states that Pelletier and Carstens found the opposite:

“…pairwise Bayes factor[s] will always be more conservative except in cases when the posterior probabilities are equal for all models that are less probable than the most probable model.”

**W**hich means that the “Bayes factor” in this study is computed as the ratio of a marginal likelihood and of a compound (or super-marginal) likelihood, averaged over all models and hence incorporating the prior probabilities of the model indices as well. I had never encountered such a proposal before. Contrary to the letter’s claim:

“…using the Bayes factor, incorporating all models is perhaps more consistent with the Bayesian approach of incorporating all uncertainty associated with the ABC model choice procedure.”

**B**esides the needless inclusion of ABC in this sentence, a somewhat confusing sentence, as Bayes factors are not, *stricto sensu*, Bayesian procedures since they remove the prior probabilities from the picture.

“Although the outcome of model comparison with ABC or other similar likelihood-based methods will always be dependent on the composition of the model set, and parameter estimates will only be as good as the models that are used, model-based inference provides a number of benefits.”

**A**ll models are wrong but the very fact that they are models allows for producing pseudo-data from those models and for checking if the pseudo-data is similar enough to the observed data. In components that matters the most for the experimenter. Hence a loss function of sorts…