## demystify Lindley’s paradox [or not]

Posted in Statistics with tags , , , , , on March 18, 2020 by xi'an

Another paper on Lindley’s paradox appeared on arXiv yesterday, by Guosheng Yin and Haolun Shi, interpreting posterior probabilities as p-values. The core of this resolution is to express a two-sided hypothesis as a combination of two one-sided hypotheses along the opposite direction, taking then advantage of the near equivalence of posterior probabilities under some non-informative prior and p-values in the later case. As already noted by George Casella and Roger Berger (1987) and presumably earlier. The point is that one-sided hypotheses are quite friendly to improper priors, since they only require a single prior distribution. Rather than two when point nulls are under consideration. The p-value created by merging both one-sided hypotheses makes little sense to me as it means testing that both θ≥0 and θ≤0, resulting in the proposal of a p-value that is twice the minimum of the one-sided p-values, maybe due to a Bonferroni correction, although the true value should be zero… I thus see little support for this approach to resolving Lindley paradox in that it bypasses the toxic nature of point-null hypotheses that require a change of prior toward a mixture supporting one hypothesis and the other. Here the posterior of the point-null hypothesis is defined in exactly the same way the p-value is defined, hence making the outcome most favourable to the agreement but not truly addressing the issue.

## unrejected null [xkcd]

Posted in Statistics with tags , , , , , on July 18, 2018 by xi'an

## estimation versus testing [again!]

Posted in Books, Statistics, University life with tags , , , , , , , , , , on March 30, 2017 by xi'an

The following text is a review I wrote of the paper “Parameter estimation and Bayes factors”, written by J. Rouder, J. Haff, and J. Vandekerckhove. (As the journal to which it is submitted gave me the option to sign my review.)

The opposition between estimation and testing as a matter of prior modelling rather than inferential goals is quite unusual in the Bayesian literature. In particular, if one follows Bayesian decision theory as in Berger (1985) there is no such opposition, but rather the use of different loss functions for different inference purposes, while the Bayesian model remains single and unitarian.

Following Jeffreys (1939), it sounds more congenial to the Bayesian spirit to return the posterior probability of an hypothesis H⁰ as an answer to the question whether this hypothesis holds or does not hold. This however proves impossible when the “null” hypothesis H⁰ has prior mass equal to zero (or is not measurable under the prior). In such a case the mathematical answer is a probability of zero, which may not satisfy the experimenter who asked the question. More fundamentally, the said prior proves inadequate to answer the question and hence to incorporate the information contained in this very question. This is how Jeffreys (1939) justifies the move from the original (and deficient) prior to one that puts some weight on the null (hypothesis) space. It is often argued that the move is unnatural and that the null space does not make sense, but this only applies when believing very strongly in the model itself. When considering the issue from a modelling perspective, accepting the null H⁰ means using a new model to represent the model and hence testing becomes a model choice problem, namely whether or not one should use a complex or simplified model to represent the generation of the data. This is somehow the “unification” advanced in the current paper, albeit it does appear originally in Jeffreys (1939) [and then numerous others] rather than the relatively recent Mitchell & Beauchamp (1988). Who may have launched the spike & slab denomination.

I have trouble with the analogy drawn in the paper between the spike & slab estimate and the Stein effect. While the posterior mean derived from the spike & slab posterior is indeed a quantity drawn towards zero by the Dirac mass at zero, it is rarely the point in using a spike & slab prior, since this point estimate does not lead to a conclusion about the hypothesis: for one thing it is never exactly zero (if zero corresponds to the null). For another thing, the construction of the spike & slab prior is both artificial and dependent on the weights given to the spike and to the slab, respectively, to borrow expressions from the paper. This approach thus leads to model averaging rather than hypothesis testing or model choice and therefore fails to answer the (possibly absurd) question as to which model to choose. Or refuse to choose. But there are cases when a decision must be made, like continuing a clinical trial or putting a new product on the market. Or not.

In conclusion, the paper surprisingly bypasses the decision-making aspect of testing and hence ends up with a inconclusive setting, staying midstream between Bayes factors and credible intervals. And failing to provide a tool for decision making. The paper also fails to acknowledge the strong dependence of the Bayes factor on the tail behaviour of the prior(s), which cannot be [completely] corrected by a finite sample, hence its relativity and the unreasonableness of a fixed scale like Jeffreys’ (1939).

## Measuring statistical evidence using relative belief [book review]

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , , , , on July 22, 2015 by xi'an

“It is necessary to be vigilant to ensure that attempts to be mathematically general do not lead us to introduce absurdities into discussions of inference.” (p.8)

This new book by Michael Evans (Toronto) summarises his views on statistical evidence (expanded in a large number of papers), which are a quite unique mix of Bayesian  principles and less-Bayesian methodologies. I am quite glad I could receive a version of the book before it was published by CRC Press, thanks to Rob Carver (and Keith O’Rourke for warning me about it). [Warning: this is a rather long review and post, so readers may chose to opt out now!]

“The Bayes factor does not behave appropriately as a measure of belief, but it does behave appropriately as a measure of evidence.” (p.87)

## Adaptive revised standards for statistical evidence [guest post]

Posted in Books, Statistics, University life with tags , , , , , , , on March 25, 2014 by xi'an

[Here is a discussion of Valen Johnson’s PNAS paper written by Luis Pericchi, Carlos Pereira, and María-Eglée Pérez, in conjunction with an arXived paper of them I never came to discuss. This has been accepted by PNAS along with a large number of other letters. Our discussion permuting the terms of the original title also got accepted.]

Johnson [1] argues for decreasing the bar of statistical significance from 0.05 and 0.01 to 0:005 and 0:001 respectively. There is growing evidence that the canonical fixed standards of significance are inappropriate. However, the author simply proposes other fixed standards. The essence of the problem of classical testing of significance lies on its goal of minimizing type II error (false negative) for a fixed type I error (false positive). A real departure instead would be to minimize a weighted sum of the two errors, as proposed by Jeffreys [2]. Significance levels that are constant with respect to sample size do not balance errors. Size levels of 0.005 and 0.001 certainly will lower false positives (type I error) to the expense of increasing type II error, unless the study is carefully de- signed, which is not always the case or not even possible. If the sample size is small the type II error can become unacceptably large. On the other hand for large sample sizes, 0.005 and 0.001 levels may be too high. Consider the Psychokinetic data, Good [3]: the null hypothesis is that individuals can- not change by mental concentration the proportion of 1’s in a sequence of n = 104; 490; 000 0’s and 1’s, generated originally with a proportion of 1=2. The proportion of 1’s recorded was 0:5001768. The observed p-value is p = 0.0003, therefore according to the present revision of standards, still the null hypothesis is rejected and a Psychokinetic effect claimed. This is contrary to intuition and to virtually any Bayes Factor. On the other hand to make the standards adaptable to the amount of information (see also Raftery [4]) Perez and Pericchi [5] approximate the behavior of Bayes Factors by,

$\alpha_{\mathrm{ref}}(n)=\alpha\,\dfrac{\sqrt{n_0(\log(n_0)+\chi^2_\alpha(1))}}{\sqrt{n(\log(n)+\chi^2_\alpha(1))}}$

This formula establishes a bridge between carefully designed tests and the adaptive behavior of Bayesian tests: The value n0 comes from a theoretical design for which a value of both errors has been specified ed, and n is the actual (larger) sample size. In the Psychokinetic data n0 = 44,529 for type I error of 0:01, type II error of 0.05 to detect a difference of 0.01. The αref (104, 490,000) = 0.00017 and the null of no Psychokinetic effect is accepted.

A simple constant recipe is not the solution to the problem. The standard how to judge the evidence should be a function of the amount of information. Johnson’s main message is to toughen the standards and design the experiments accordingly. This is welcomed whenever possible. But it does not balance type I and type II errors: it would be misleading to pass the message—use now standards divided by ten, regardless of neither type II errors nor sample sizes. This would move the problem without solving it.