## p-values, Bayes factors, and sufficiency

Posted in Books, pictures, Statistics with tags , , , , , , , , , on April 15, 2019 by xi'an Among the many papers published in this special issue of TAS on statistical significance or lack thereof, there is a paper I had already read before (besides ours!), namely the paper by Jonty Rougier (U of Bristol, hence the picture) on connecting p-values, likelihood ratio, and Bayes factors. Jonty starts from the notion that the p-value is induced by a transform, summary, statistic of the sample, t(x), the larger this t(x), the less likely the null hypothesis, with density f⁰(x), to create an embedding model by exponential tilting, namely the exponential family with dominating measure f⁰, and natural statistic, t(x), and a positive parameter θ. In this embedding model, a Bayes factor can be derived from any prior on θ and the p-value satisfies an interesting double inequality, namely that it is less than the likelihood ratio, itself lower than any (other) Bayes factor. One novel aspect from my perspective is that I had thought up to now that this inequality only holds for one-dimensional problems, but there is no constraint here on the dimension of the data x. A remark I presumably made to Jonty on the first version of the paper is that the p-value itself remains invariant under a bijective increasing transform of the summary t(.). This means that there exists an infinity of such embedding families and that the bound remains true over all such families, although the value of this minimum is beyond my reach (could it be the p-value itself?!). This point is also clear in the justification of the analysis thanks to the Pitman-Koopman lemma. Another remark is that the perspective can be inverted in a more realistic setting when a genuine alternative model M¹ is considered and a genuine likelihood ratio is available. In that case the Bayes factor remains smaller than the likelihood ratio, itself larger than the p-value induced by the likelihood ratio statistic. Or its log. The induced embedded exponential tilting is then a geometric mixture of the null and of the locally optimal member of the alternative. I wonder if there is a parameterisation of this likelihood ratio into a p-value that would turn it into a uniform variate (under the null). Presumably not. While the approach remains firmly entrenched within the realm of p-values and Bayes factors, this exploration of a natural embedding of the original p-value is definitely worth mentioning in a class on the topic! (One typo though, namely that the Bayes factor is mentioned to be lower than one, which is incorrect.)

## abandon ship [value]!!!

Posted in Books, Statistics, University life with tags , , , , , , , , , on March 22, 2019 by xi'an The Abandon Statistical Significance paper we wrote with Blakeley B. McShane, David Gal, Andrew Gelman, and Jennifer L. Tackett has now appeared in a special issue of The American Statistician, “Statistical Inference in the 21st Century: A World Beyond p < 0.05“.  A 400 page special issue with 43 papers available on-line and open-source! Food for thought likely to be discussed further here (and elsewhere). The paper and the ideas within have been discussed quite a lot on Andrew’s blog and I will not repeat them here, simply quoting from the conclusion of the paper

In this article, we have proposed to abandon statistical significance and offered recommendations for how this can be implemented in the scientific publication process as well as in statistical decision making more broadly. We reiterate that we have no desire to “ban” p-values or other purely statistical measures. Rather, we believe that such measures should not be thresholded and that, thresholded or not, they should not take priority over the currently subordinate factors. Which also introduced in a comment by Valentin Amrhein, Sander Greenland, and Blake McShane published in Nature today (and supported by 800+ signatures). Again discussed on Andrew’s blog.

## absurdly unbiased estimators

Posted in Books, Kids, Statistics with tags , , , , , , , on November 8, 2018 by xi'an

“…there are important classes of problems for which the mathematics forces the existence of such estimators.”

Recently I came through a short paper written by Erich Lehmann for The American Statistician, Estimation with Inadequate Information. He analyses the apparent absurdity of using unbiased estimators or even best unbiased estimators in settings like the Poisson P(λ) observation X producing the (unique) unbiased estimator of exp(-bλ) equal to $(1-b)^x$

which is indeed absurd when b>1. My first reaction to this example is that the question of what is “best” for a single observation is not very meaningful and that adding n independent Poisson observations replaces b with b/n, which gets eventually less than one. But Lehmann argues that the paradox stems from a case of missing information, as for instance in the Poisson example where the above quantity is the probability P(T=0) that T=0, when T=X+Y, Y being another unobserved Poisson with parameter (b-1)λ. In a lot of such cases, there is no unbiased estimator at all. When there is any, it must take values outside the (0,1) range, thanks to a lemma shown by Lehmann that the conditional expectation of this estimator given T is either zero or one.

I find the short paper quite interesting in exposing some reasons why the estimators cannot find enough information within the data (often a single point) to achieve an efficient estimation of the targeted function of the parameter, even though the setting may appear rather artificial.

## almost uniform but far from straightforward

Posted in Books, Kids, Statistics with tags , , , , , , , on October 24, 2018 by xi'an A question on X validated about a [not exactly trivial] maximum likelihood for a triangular function led me to a fascinating case, as exposed by Olver in 1972 in The American Statistician. When considering an asymmetric triangle distribution on (0,þ), þ being fixed, the MLE for the location of the tip of the triangle is necessarily one of the observations [which was not the case in the original question on X validated ]. And not in an order statistic of rank j that does not stand in the j-th uniform partition of (0,þ). Furthermore there are opportunities for observing several global modes… In the X validated case of the symmetric triangular distribution over (0,θ), with ½θ as tip of the triangle, I could not figure an alternative to the pedestrian solution of looking separately at each of the (n+1) intervals where θ can stand and returning the associated maximum on that interval. Definitely a good (counter-)example about (in)sufficiency for class or exam!

## Gibbs for kidds

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , , , , , , on February 12, 2018 by xi'an A chance (?) question on X validated brought me to re-read Gibbs for Kids, 25 years after it was written (by my close friends George and Ed). The originator of the question had difficulties with the implementation, apparently missing the cyclic pattern of the sampler, as in equations (2.3) and (2.4), and with the convergence, which is only processed for a finite support in the American Statistician paper. The paper [which did not appear in American Statistician under this title!, but inspired an animal bredeer, Dan Gianola, to write a “Gibbs for pigs” presentation in 1993 at the 44th Annual Meeting of the European Association for Animal Production, Aarhus, Denmark!!!] most appropriately only contains toy examples since those can be processed and compared to know stationary measures. This is for instance the case for the auto-exponential model $f(x,y) \propto exp(-xy)$

which is only defined as a probability density for a compact support. (The paper does not identify the model as a special case of auto-exponential model, which apparently made the originator of the model, Julian Besag in 1974, unhappy, as George and I found out when visiting Bath, where Julian was spending the final year of his life, many years later.) I use the limiting case all the time in class to point out that a Gibbs sampler can be devised and operate without a stationary probability distribution. However, being picky!, I would like to point out that, contrary, to a comment made in the paper, the Gibbs sampler does not “fail” but on the contrary still “converges” in this case, in the sense that a conditional ergodic theorem applies, i.e., the ratio of the frequencies of visits to two sets A and B with finite measure do converge to the ratio of these measures. For instance, running the Gibbs sampler 10⁶ steps and ckecking for the relative frequencies of x’s in (1,2) and (1,3) gives 0.685, versus log(2)/log(3)=0.63, since 1/x is the stationary measure. One important and influential feature of the paper is to stress that proper conditionals do not imply proper joints. George would work much further on that topic, in particular with his PhD student at the time, my friend Jim Hobert.

With regard to the convergence issue, Gibbs for Kids points out to Schervish and Carlin (1990), which came quite early when considering Gelfand and Smith published their initial paper the very same year, but which also adopts a functional approach to convergence, along the paper’s fixed point perspective, somehow complicating the matter. Later papers by Tierney (1994), Besag (1995), and Mengersen and Tweedie (1996) considerably simplified the answer, which is that irreducibility is a necessary and sufficient condition for convergence. (Incidentally, the reference list includes a technical report of mine’s on latent variable model MCMC implementation that never got published.)

## an improvable Rao–Blackwell improvement, inefficient maximum likelihood estimator, and unbiased generalized Bayes estimator

Posted in Books, Statistics, University life with tags , , , , , , , , on February 2, 2018 by xi'an In my quest (!) for examples of location problems with no UMVU estimator, I came across a neat paper by Tal Galili [of R Bloggers fame!] and Isaac Meilijson presenting somewhat paradoxical properties of classical estimators in the case of a Uniform U((1-k)θ,(1+k)θ) distribution when 0<k<1 is known. For this model, the minimal sufficient statistic is the pair made of the smallest and of the largest observations, L and U. Since this pair is not complete, the Rao-Blackwell theorem does not produce a single and hence optimal estimator. The best linear unbiased combination [in terms of its variance] of L and U is derived in this paper, although this does not produce the uniformly minimum variance unbiased estimator, which does not exist in this case. (And I do not understand the remark that

“Any unbiased estimator that is a function of the minimal sufficient statistic is its own Rao–Blackwell improvement.”

as this hints at an infinite sequence of improvement.) While the MLE is inefficient in this setting, the Pitman [best equivariant] estimator is both Bayes [against the scale Haar measure] and unbiased. While experimentally dominating the above linear combination. The authors also argue that, since “generalized Bayes rules need not be admissible”, there is no guarantee that the Pitman estimator is admissible (under squared error loss). But given that this is a uni-dimensional scale estimation problem I doubt very much there is a Stein effect occurring in this case.

## foundations of probability

Posted in Books, Statistics with tags , , , , on December 1, 2017 by xi'an

Following my reading of a note by Gunnar Taraldsen and co-authors on improper priors, I checked the 1970 book of Rényi from the Library at Warwick. (First time I visited this library, where I got very efficient help in finding and borrowing this book!)

“…estimates of probability of an event made by different persons may be different and each such estimate is to a certain extent subjective.” (p.33)

The main argument from Rényi used by the above mentioned note (and an earlier paper in The American Statistician) is that “every probability is in reality a conditional probability” (p.34). Which may be a pleonasm as everything depends on the settings in which it is applied. And as such not particularly new since conditioning is also present in e.g. Jeffreys’ book. In this approach, the definition of the conditional probability is traditional, if restricted to condition on a subset of elements from the σ algebra. The interesting part in the book is rather that a measure on this subset can be derived from the conditionals. And extended to the whole σ algebra. And is unique up to a multiplicative constant. Interesting because this indeed produces a rigorous way of handling improper priors.

“Let the random point (ξ,η) be uniformly distributed over the whole (x,y) plane.” (p.83)

Rényi also defines random variables ξ on conditional probability spaces, with conditional densities. With constraints on ξ for those to exist. I have more difficulties to ingest this notion as I do not see the meaning of the above quote or of the quantity

P(a<ξ<b|c<ξ<d)

when P(a<ξ<b) is not defined. As for instance I see no way of generating such a ξ in this case. (Of course, it is always possible to bring in a new definition of random variables that only agrees with regular ones for finite measure.)