## using mixtures towards Bayes factor approximation

Posted in Statistics, Travel, University life with tags , , , , , , on December 11, 2014 by xi'an

Phil O’Neill and Theodore Kypraios from the University of Nottingham have arXived last week a paper on “Bayesian model choice via mixture distributions with application to epidemics and population process models”. Since we discussed this paper during my visit there earlier this year, I was definitely looking forward the completed version of their work. Especially because there are some superficial similarities with our most recent work on… Bayesian model choice via mixtures! (To the point that I misunderstood at the beginning their proposal for ours…)

The central idea in the paper is that, by considering the mixture likelihood

$\alpha\ell_1(\theta_1|\mathbf{x})+(1-\alpha)\ell_2(\theta_2|\mathbf{x})$

where x corresponds to the entire sample, it is straighforward to relate the moments of α with the Bayes factor, namely

$\mathfrak{B}_{12}=\dfrac{\mathbb{E}[\alpha]-\mathbb{E}[\alpha^2]-\mathbb{E}[\alpha|\mathbf{x}](1-\mathbb{E}[\alpha])}{\mathbb{E}[\alpha]\mathbb{E}[\alpha|\mathbf{x}]-\mathbb{E}[\alpha^2]}$

which means that estimating the mixture weight α by MCMC is equivalent to estimating the Bayes factor.

What puzzled me at first was that the mixture weight is in fine estimated with a single “datapoint”, made of the entire sample. So the posterior distribution on α is hardly different from the prior, since it solely varies by one unit! But I came to realise that this is a numerical tool and that the estimator of α is not meaningful  from a statistical viewpoint (thus differing completely from our perspective). This explains why the Beta prior on α can be freely chosen so that the mixing and stability of the Markov chain is improved: This parameter is solely an algorithmic entity.

There are similarities between this approach and the pseudo-prior encompassing perspective of Carlin and Chib (1995), even though the current version does not require pseudo-priors, using true priors instead. But thinking of weakly informative priors and of the MCMC consequence (see below) leads me to wonder if pseudo-priors would not help in this setting…

Another aspect of the paper that still puzzles me is that the MCMC algorithm mixes at all: indeed, depending on the value of the binary latent variable z, one of the two parameters is updated from the true posterior while the other is updated from the prior. It thus seems unlikely that the value of z would change quickly. Creating a huge imbalance in the prior can counteract this difference, but the same problem occurs once z has moved from 0 to 1 or from 1 to 0. It seems to me that resorting to a common parameter [if possible] and using as a proposal the model-based posteriors for both parameters is the only way out of this conundrum. (We do certainly insist on this common parametrisation in our approach as it is paramount to the use of improper priors.)

“In contrast, we consider the case where there is only one datum.”

The idea in the paper is therefore fully computational and relates to other linkage methods that create bridges between two models. It differs from our new notion of Bayesian testing in that we consider estimating the mixture between the two models in comparison, hence considering instead the mixture

$\prod_{i=1}^n\alpha f_1(x_i|\theta_1)+(1-\alpha) f_2(x_i|\theta_2)$

which is another model altogether and does not recover the original Bayes factor (Bayes factor that we altogether dismiss in favour of the posterior median of α and its entire distribution).

## differences between Bayes factors and normalised maximum likelihood

Posted in Books, Kids, Statistics, University life with tags , , , , on November 19, 2014 by xi'an

A recent arXival by Heck, Wagenmaker and Morey attracted my attention: Three Qualitative Differences Between Bayes Factors and Normalized Maximum Likelihood, as it provides an analysis of the differences between Bayesian analysis and Rissanen’s Optimal Estimation of Parameters that I reviewed a while ago. As detailed in this review, I had difficulties with considering the normalised likelihood

$p(x|\hat\theta_x) \big/ \int_\mathcal{X} p(y|\hat\theta_y)\,\text{d}y$

as the relevant quantity. One reason being that the distribution does not make experimental sense: for instance, how can one simulate from this distribution? [I mean, when considering only the original distribution.] Working with the simple binomial B(n,θ) model, the authors show the quantity corresponding to the posterior probability may be constant for most of the data values, produces a different upper bound and hence a different penalty of model complexity, and may differ in conclusion for some observations. Which means that the apparent proximity to using a Jeffreys prior and Rissanen’s alternative does not go all the way. While it is a short note and only focussed on producing an illustration in the Binomial case, I find it interesting that researchers investigate the Bayesian nature (vs. artifice!) of this approach…

## how to translate evidence into French?

Posted in Books, Statistics, University life with tags , , , , , on July 6, 2014 by xi'an

I got this email from Gauvain who writes a PhD in philosophy of sciences a few minutes ago:

L’auteur du texte que j’ai à traduire désigne les facteurs de Bayes comme une “Bayesian measure of evidence”, et les tests de p-value comme une “frequentist measure of evidence”. Je me demandais s’il existait une traduction française reconnue et établie pour cette expression de “measure of evidence”. J’ai rencontré parfois “mesure d’évidence” qui ressemble fort à un anglicisme, et parfois “estimateur de preuve”, mais qui me semble pouvoir mener à des confusions avec d’autres emploi du terme “estimateur”.

which (pardon my French!) wonders how to translate the term evidence into French. It would sound natural that the French évidence is the answer but this is not the case. Despite sharing the same Latin root (evidentia), since the English version comes from medieval French, the two words have different meanings: in English, it means a collection of facts coming to support an assumption or a theory, while in French it means something obvious, which truth is immediately perceived. Surprisingly, English kept the adjective evident with the same [obvious] meaning as the French évident. But the noun moved towards a much less definitive meaning, both in Law and in Science. I had never thought of the huge gap between the two meanings but must have been surprised at its use the first time I heard it in English. But does not think about it any longer, as when I reviewed Seber’s Evidence and Evolution.

One may wonder at the best possible translation of evidence into French. Even though marginal likelihood (vraisemblance marginale) is just fine for statistical purposes. I would suggest faisceau de présomptions or degré de soutien or yet intensité de soupçon as (lengthy) solutions. Soupçon could work as such, but has a fairly negative ring…

## Adaptive revised standards for statistical evidence [guest post]

Posted in Books, Statistics, University life with tags , , , , , , , on March 25, 2014 by xi'an

[Here is a discussion of Valen Johnson’s PNAS paper written by Luis Pericchi, Carlos Pereira, and María-Eglée Pérez, in conjunction with an arXived paper of them I never came to discuss. This has been accepted by PNAS along with a large number of other letters. Our discussion permuting the terms of the original title also got accepted.]

Johnson [1] argues for decreasing the bar of statistical significance from 0.05 and 0.01 to 0:005 and 0:001 respectively. There is growing evidence that the canonical fixed standards of significance are inappropriate. However, the author simply proposes other fixed standards. The essence of the problem of classical testing of significance lies on its goal of minimizing type II error (false negative) for a fixed type I error (false positive). A real departure instead would be to minimize a weighted sum of the two errors, as proposed by Jeffreys [2]. Significance levels that are constant with respect to sample size do not balance errors. Size levels of 0.005 and 0.001 certainly will lower false positives (type I error) to the expense of increasing type II error, unless the study is carefully de- signed, which is not always the case or not even possible. If the sample size is small the type II error can become unacceptably large. On the other hand for large sample sizes, 0.005 and 0.001 levels may be too high. Consider the Psychokinetic data, Good [3]: the null hypothesis is that individuals can- not change by mental concentration the proportion of 1’s in a sequence of n = 104; 490; 000 0’s and 1’s, generated originally with a proportion of 1=2. The proportion of 1’s recorded was 0:5001768. The observed p-value is p = 0.0003, therefore according to the present revision of standards, still the null hypothesis is rejected and a Psychokinetic effect claimed. This is contrary to intuition and to virtually any Bayes Factor. On the other hand to make the standards adaptable to the amount of information (see also Raftery [4]) Perez and Pericchi [5] approximate the behavior of Bayes Factors by,

$\alpha_{\mathrm{ref}}(n)=\alpha\,\dfrac{\sqrt{n_0(\log(n_0)+\chi^2_\alpha(1))}}{\sqrt{n(\log(n)+\chi^2_\alpha(1))}}$

This formula establishes a bridge between carefully designed tests and the adaptive behavior of Bayesian tests: The value n0 comes from a theoretical design for which a value of both errors has been specified ed, and n is the actual (larger) sample size. In the Psychokinetic data n0 = 44,529 for type I error of 0:01, type II error of 0.05 to detect a difference of 0.01. The αref (104, 490,000) = 0.00017 and the null of no Psychokinetic effect is accepted.

A simple constant recipe is not the solution to the problem. The standard how to judge the evidence should be a function of the amount of information. Johnson’s main message is to toughen the standards and design the experiments accordingly. This is welcomed whenever possible. But it does not balance type I and type II errors: it would be misleading to pass the message—use now standards divided by ten, regardless of neither type II errors nor sample sizes. This would move the problem without solving it.

## Statistical frontiers (course in Warwick)

Posted in Books, Statistics, University life with tags , , , , on January 30, 2014 by xi'an

Today I am teaching my yearly class at Warwick as a short introduction to computational techniques for Bayes factors approximation for MASDOC and PhD students in the Statistical Frontiers seminar, gathering several talks from the past years. Here are my slides:

## robust Bayesian FDR control with Bayes factors [a reply]

Posted in Statistics, University life with tags , , , , on January 17, 2014 by xi'an

(Following my earlier discussion of his paper, Xiaoquan Wen sent me this detailed reply.)

I think it is appropriate to start my response to your comments by introducing a little bit of the background information on my research interest and the project itself: I consider myself as an applied statistician, not a theorist, and I am interested in developing theoretically sound and computationally efficient methods to solve practical problems. The FDR project originated from a practical application in genomics involving hypothesis testing. The details of this particular application can be found in this published paper, and the simulations in the manuscript are also designed for a similar context. In this application, the null model is trivially defined, however there exist finitely many alternative scenarios for each test. We proposed a Bayesian solution that handles this complex setting quite nicely: in brief, we chose to model each possible alternative scenario parametrically, and by taking advantage of Bayesian model averaging, Bayes factor naturally ended up as our test statistic. We had no problem in demonstrating the resulting Bayes factor is much more powerful than the existing approaches, even accounting for the prior (mis-)modeling for Bayes factors. However, in this genomics application, there are potentially tens of thousands of tests need to be simultaneously performed, and FDR control becomes necessary and challenging. Continue reading

## Statistical evidence for revised standards

Posted in Statistics, University life with tags , , , , , , , , , on December 30, 2013 by xi'an

In yet another permutation of the original title (!), Andrew Gelman posted the answer Val Johnson sent him after our (submitted)  letter to PNAS. As Val did not send me a copy (although Andrew did!), I will not reproduce it here and I rather refer the interested readers to Andrews’ blog… In addition to Andrew’s (sensible) points, here are a few idle (post-X’mas and pre-skiing) reflections:

• “evidence against a false null hypothesis accrues exponentially fast” makes me wonder in which metric this exponential rate (in γ?) occurs;
• that “most decision-theoretic analyses of the optimal threshold to use for declaring a significant finding would lead to evidence thresholds that are substantially greater than 5 (and probably also greater 25)” is difficult to accept as an argument since there is no trace of a decision-theoretic argument in the whole paper;
• Val rejects our minimaxity argument on the basis that “[UMPBTs] do not involve minimization of maximum loss” but the prior that corresponds to those tests is minimising the integrated probability of not rejecting at threshold level γ, a loss function integrated against parameter and observation, a Bayes risk in other words… Point masses or spike priors are clearly characteristics of minimax priors. Furthermore, the additional argument that “in most applications, however, a unique loss function/prior distribution combination does not exist” has been used by many to refute the Bayesian perspective and makes me wonder what are the arguments left in using a (pseudo-)Bayesian approach;
• the next paragraph is pure tautology: the fact that “no other test, based on either a subjectively or objectively specified alternative hypothesis, is as likely to produce a Bayes factor that exceeds the specified evidence threshold” is a paraphrase of the definition of UMPBTs, not an argument. I do not see we should solely “worry about false negatives”, since minimising those should lead to a point mass on the null (or, more seriously, should not lead to the minimax-like selection of the prior under the alternative).