## same risk, different estimators

Posted in Statistics with tags , , , , , on November 10, 2017 by xi'an

An interesting question on X validated reminded me of the epiphany I had some twenty years ago when reading a Annals of Statistics paper by Anirban Das Gupta and Bill Strawderman on shrinkage estimators, namely that some estimators shared the same risk function, meaning their integrated loss was the same for all values of the parameter. As indicated in this question, Stefan‘s instructor seems to believe that two estimators having the same risk function must be a.s. identical. Which is not true as exemplified by the James-Stein (1960) estimator with scale 2(p-2), which has constant risk p, just like the maximum likelihood estimator. I presume the confusion stemmed from the concept of completeness, where having a function with constant expectation under all values of the parameter implies that this function is constant. But, for loss functions, the concept does not apply since the loss depends both on the observation (that is complete in a Normal model) and on the parameter.

## on Dutch book arguments

Posted in Books, Kids, pictures, Statistics, Travel, University life with tags , , , , , , , , , on May 1, 2017 by xi'an

“Reality is not always probable, or likely.”― Jorge Luis Borges

As I am supposed to discuss Teddy Seidenfeld‘s talk at the Bayes, Fiducial and Frequentist conference in Harvard today [the snow happened last time!], I started last week [while driving to Wales] reading some related papers of his. Which is great as I had never managed to get through the Dutch book arguments, including those in Jim’s book.

The paper by Mark Schervish, Teddy Seidenfeld, and Jay Kadane is defining coherence as the inability to bet against the predictive statements based on the procedure. A definition that sounds like a self-fulfilling prophecy to me as it involves a probability measure over the parameter space. Furthermore, the notion of turning inference, which aims at scientific validation, into a leisure, no-added-value, and somewhat ethically dodgy like gambling, does not agree with my notion of a validation for a theory. That is, not as a compelling reason for adopting a Bayesian approach. Not that I have suddenly switched to the other [darker] side, but I do not feel those arguments helping in any way, because of this dodgy image associated with gambling. (Pardon my French, but each time I read about escrows, I think of escrocs, or crooks, which reinforces this image! Actually, this name derives from the Old French escroue, but the modern meaning of écroué is sent to jail, which brings us back to the same feeling…)

Furthermore, it sounds like both a weak notion, since it implies an almost sure loss for the bookmaker, plus coherency holds for any prior distribution, including Dirac masses!, and a frequentist one, in that it looks at all possible values of the parameter (in a statistical framework). It also turns errors into monetary losses, taking them at face value. Which sounds also very formal to me.

But the most fundamental problem I have with this approach is that, from a Bayesian perspective, it does not bring any evaluation or ranking of priors, and in particular does not help in selecting or eliminating some. By behaving like a minimax principle, it does not condition on the data and hence does not evaluate the predictive properties of the model in terms of the data, e.g. by comparing pseudo-data with real data.

While I see no reason to argue in favour of p-values or minimax decision rules, I am at a loss in understanding the examples in How to not gamble if you must. In the first case, i.e., when dismissing the α-level most powerful test in the simple vs. simple hypothesis testing case, the argument (in Example 4) starts from the classical (Neyman-Pearsonist) statistician favouring the 0.05-level test over others. Which sounds absurd, as this level corresponds to a given loss function, which cannot be compared with another loss function. Even though the authors chose to rephrase the dilemma in terms of a single 0-1 loss function and then turn the classical solution into the choice of an implicit variance-dependent prior. Plus force the poor Pearsonist to make a wager represented by the risk difference. The whole sequence of choices sounds both very convoluted and far away from the usual practice of a classical statistician… Similarly, when attacking [in Section 5.2] the minimax estimator in the Bernoulli case (for the corresponding proper prior depending on the sample size n), this minimax estimator is admissible under quadratic loss and still a Dutch book argument applies, which in my opinion definitely argues against the Dutch book reasoning. The way to produce such a domination result is to mix two Bernoulli estimation problems for two different sample sizes but the same parameter value, in which case there exist [other] choices of Beta priors and a convex combination of the risks functions that lead to this domination. But this example [Example 6] mostly exposes the artificial nature of the argument: when estimating the very same probability θ, what is the relevance of adding the risks or errors resulting from using two estimators for two different sample sizes. Of the very same probability θ. I insist on the very same because when instead estimating two [independent] values of θ, there cannot be a Stein effect for the Bernoulli probability estimation problem, that is, any aggregation of admissible estimators remains admissible. (And yes it definitely sounds like an exercise in frequentist decision theory!)

## contemporary issues in hypothesis testing

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , on May 3, 2016 by xi'an

Next Fall, on 15-16 September, I will take part in a CRiSM workshop on hypothesis testing. In our department in Warwick. The registration is now open [until Sept 2] with a moderate registration free of £40 and a call for posters. Jim Berger and Joris Mulder will both deliver a plenary talk there, while Andrew Gelman will alas give a remote talk from New York. (A terrific poster by the way!)

## Leave the Pima Indians alone!

Posted in Books, R, Statistics, University life with tags , , , , , , , , , , , , , , , , on July 15, 2015 by xi'an

“…our findings shall lead to us be critical of certain current practices. Specifically, most papers seem content with comparing some new algorithm with Gibbs sampling, on a few small datasets, such as the well-known Pima Indians diabetes dataset (8 covariates). But we shall see that, for such datasets, approaches that are even more basic than Gibbs sampling are actually hard to beat. In other words, datasets considered in the literature may be too toy-like to be used as a relevant benchmark. On the other hand, if ones considers larger datasets (with say 100 covariates), then not so many approaches seem to remain competitive” (p.1)

Nicolas Chopin and James Ridgway (CREST, Paris) completed and arXived a paper they had “threatened” to publish for a while now, namely why using the Pima Indian R logistic or probit regression benchmark for checking a computational algorithm is not such a great idea! Given that I am definitely guilty of such a sin (in papers not reported in the survey), I was quite eager to read the reasons why! Beyond the debate on the worth of such a benchmark, the paper considers a wider perspective as to how Bayesian computation algorithms should be compared, including the murky waters of CPU time versus designer or programmer time. Which plays against most MCMC sampler.

As a first entry, Nicolas and James point out that the MAP can be derived by standard a Newton-Raphson algorithm when the prior is Gaussian, and even when the prior is Cauchy as it seems most datasets allow for Newton-Raphson convergence. As well as the Hessian. We actually took advantage of this property in our comparison of evidence approximations published in the Festschrift for Jim Berger. Where we also noticed the awesome performances of an importance sampler based on the Gaussian or Laplace approximation. The authors call this proposal their gold standard. Because they also find it hard to beat. They also pursue this approximation to its logical (?) end by proposing an evidence approximation based on the above and Chib’s formula. Two close approximations are provided by INLA for posterior marginals and by a Laplace-EM for a Cauchy prior. Unsurprisingly, the expectation-propagation (EP) approach is also implemented. What EP lacks in theoretical backup, it seems to recover in sheer precision (in the examples analysed in the paper). And unsurprisingly as well the paper includes a randomised quasi-Monte Carlo version of the Gaussian importance sampler. (The authors report that “the improvement brought by RQMC varies strongly across datasets” without elaborating for the reasons behind this variability. They also do not report the CPU time of the IS-QMC, maybe identical to the one for the regular importance sampling.) Maybe more surprising is the absence of a nested sampling version.

In the Markov chain Monte Carlo solutions, Nicolas and James compare Gibbs, Metropolis-Hastings, Hamiltonian Monte Carlo, and NUTS. Plus a tempering SMC, All of which are outperformed by importance sampling for small enough datasets. But get back to competing grounds for large enough ones, since importance sampling then fails.

“…let’s all refrain from now on from using datasets and models that are too simple to serve as a reasonable benchmark.” (p.25)

This is a very nice survey on the theme of binary data (more than on the comparison of algorithms in that the authors do not really take into account design and complexity, but resort to MSEs versus CPus). I however do not agree with their overall message to leave the Pima Indians alone. Or at least not for the reason provided therein, namely that faster and more accurate approximations methods are available and cannot be beaten. Benchmarks always have the limitation of “what you get is what you see”, i.e., the output associated with a single dataset that only has that many idiosyncrasies. Plus, the closeness to a perfect normal posterior makes the logistic posterior too regular to pause a real challenge (even though MCMC algorithms are as usual slower than iid sampling). But having faster and more precise resolutions should on the opposite be  cause for cheers, as this provides a reference value, a golden standard, to check against. In a sense, for every Monte Carlo method, there is a much better answer, namely the exact value of the integral or of the optimum! And one is hardly aiming at a more precise inference for the benchmark itself: those Pima Indians [whose actual name is Akimel O’odham] with diabetes involved in the original study are definitely beyond help from statisticians and the model is unlikely to carry out to current populations. When the goal is to compare methods, as in our 2009 paper for Jim Berger’s 60th birthday, what matters is relative speed and relative ease of implementation (besides the obvious convergence to the proper target). In that sense bigger and larger is not always relevant. Unless one tackles really big or really large datasets, for which there is neither benchmark method nor reference value.

## Cancún, ISBA 2014 [day #0]

Posted in Statistics, Travel, University life with tags , , , , , , , , on July 17, 2014 by xi'an

Day zero at ISBA 2014! The relentless heat outside (making running an ordeal, even at 5:30am…) made the (air-conditioned) conference centre the more attractive. Jean-Michel Marin and I had a great morning teaching our ABC short course and we do hope the ABC class audience had one as well. Teaching in pair is much more enjoyable than single as we can interact with one another as well as the audience. And realising unsuspected difficulties with the material is much easier this way, as the (mostly) passive instructor can spot the class’ reactions. This reminded me of the course we taught together in Oulu, northern Finland, in 2004 and that ended as the Bayesian Core. We did not cover the entire material we have prepared for this short course, but I think the pace was the right one. (Just tell me otherwise if you were there!) This was also the only time I had given a course wearing sunglasses, thanks to yesterday’s incident!

Waiting for a Spanish speaking friend to kindly drive with me downtown Cancún to check whether or not an optician could make me new prescription glasses, I attended Jim Berger’s foundational lecture on frequentist properties of Bayesian procedures but could only listen as the slides were impossible for me to read, with or without glasses. The partial overlap with the Varanasi lecture helped. I alas had to skip both Gareth Roberts’ and Sylvia Früwirth-Schnatter’s lectures, apologies to both of them!, but the reward was to get a new pair of prescription glasses within a few hours. Perfectly suited to my vision! And to get back just in time to read slides during Peter Müller’s lecture from the back row! Thanks to my friend Sophie for her negotiating skills! Actually, I am still amazed at getting glasses that quickly, given the time it would have taken in, e.g., France. All set for another 15 years with the same pair?! Only if I do not go swimming with them in anything but a quiet swimming pool!

The starting dinner happened to coincide with the (second) ISBA Fellow Award ceremony. Jim acted as the grand master of ceremony and he did great to add life and side stories to the written nominations for each and everyone of the new Fellows. The Fellowships honoured Bayesian statisticians who had contributed to the field as researchers and to the society since its creation. I thus feel very honoured (and absolutely undeserving) to be included in this prestigious list, along with many friends.  (But would have loved to see two more former ISBA presidents included, esp. for their massive contribution to Bayesian theory and methodology…) And also glad to wear regular glasses instead of my morning sunglasses.

[My Internet connection during the meeting being abysmally poor, the posts will appear with some major delay! In particular, I cannot include new pictures at times I get a connection… Hence a picture of northern Finland instead of Cancún at the top of this post!]

## reading classics (#10 and #10bis)

Posted in Books, Statistics, University life with tags , , , , , , , , , on February 28, 2013 by xi'an

Today’s classics seminar was rather special as two students were scheduled to talk. It was even more special as both students had picked (without informing me) the very same article by Berger and Sellke (1987), Testing a point-null hypothesis: the irreconcilability of p-values and evidence, on the (deep?) discrepancies between frequentist p-values and Bayesian posterior probabilities. In connection with the Lindley-Jeffreys paradox. Here are Amira Mziou’s slides:

and Jiahuan Li’s slides:

for comparison.

It was a good exercise to listen to both talks, seeing two perspectives on the same paper, and I hope the students in the class got the idea(s) behind the paper. As you can see, there were obviously repetitions between the talks, including the presentation of the lower bounds for all classes considered by Jim Berger and Tom Sellke, and the overall motivation for the comparison. Maybe as a consequence of my criticisms on the previous talk, both Amira and Jiahuan put some stress on the definitions to formally define the background of the paper. (I love the poetic line: “To prevent having a non-Bayesian reality”, although I am not sure what Amira meant by this…)

I like the connection made therein with the Lindley-Jeffreys paradox since this is the core idea behind the paper. And because I am currently writing a note about the paradox. Obviously, it was hard for the students to take a more remote stand on the reason for the comparison, from questioning .the relevance of testing point null hypotheses and of comparing the numerical values of a p-value with a posterior probability, to expecting asymptotic agreement between a p-value and a Bayes factor when both are convergent quantities, to setting the same weight on both hypotheses, to the ad-hocquery of using a drift on one to equate the p-value with the Bayes factor, to use specific priors like Jeffreys’s (which has the nice feature that it corresponds to g=n in the g-prior,  as discussed in the new edition of Bayesian Core). The students also failed to remark on the fact that the developments were only for real parameters, as the phenomenon (that the lower bound on the posterior probabilities is larger than the p-value) does not happen so universally in larger dimensions.  I would have expected more discussion from the ground, but we still got good questions and comments on a) why 0.05 matters and b) why comparing  p-values and posterior probabilities is relevant. The next paper to be discussed will be Tukey’s piece on the future of statistics.

## That the likelihood principle does not hold…

Posted in Statistics, University life with tags , , , , , , , , , , on October 6, 2011 by xi'an

Coming to Section III in Chapter Seven of Error and Inference, written by Deborah Mayo, I discovered that she considers that the likelihood principle does not hold (at least as a logical consequence of the combination of the sufficiency and of the conditionality principles), thus that  Allan Birnbaum was wrong…. As well as the dozens of people working on the likelihood principle after him! Including Jim Berger and Robert Wolpert [whose book sells for \$214 on amazon!, I hope the authors get a hefty chunk of that ripper!!! Esp. when it is available for free on project Euclid…] I had not heard of  (nor seen) this argument previously, even though it has apparently created enough of a bit of a stir around the likelihood principle page on Wikipedia. It does not seem the result is published anywhere but in the book, and I doubt it would get past a review process in a statistics journal. [Judging from a serious conversation in Zürich this morning, I may however be wrong!]

The core of Birnbaum’s proof is relatively simple: given two experiments and about the same parameter θ with different sampling distributions and , such that there exists a pair of outcomes (y¹,y²) from those experiments with proportional likelihoods, i.e. as a function of θ

$f^1(y^1|\theta) = c f^2(y^2|\theta),$

one considers the mixture experiment E⁰ where  and are each chosen with probability ½. Then it is possible to build a sufficient statistic T that is equal to the data (j,x), except when j=2 and x=y², in which case T(j,x)=(1,y¹). This statistic is sufficient since the distribution of (j,x) given T(j,x) is either a Dirac mass or a distribution on {(1,y¹),(2,y²)} that only depends on c. Thus it does not depend on the parameter θ. According to the weak conditionality principle, statistical evidence, meaning the whole range of inferences possible on θ and being denoted by Ev(E,z), should satisfy

$Ev(E^0, (j,x)) = Ev(E^j,x)$

Because the sufficiency principle states that

$Ev(E^0, (j,x)) = Ev(E^0,T(j,x))$

this leads to the likelihood principle

$Ev(E^1,y^1)=Ev(E^0, (j,y^j)) = Ev(E^2,y^2)$

(See, e.g., The Bayesian Choice, pp. 18-29.) Now, Mayo argues this is wrong because

“The inference from the outcome (Ej,yj) computed using the sampling distribution of [the mixed experiment] E⁰ is appropriately identified with an inference from outcome yj based on the sampling distribution of Ej, which is clearly false.” (p.310)

This sounds to me like a direct rejection of the conditionality principle, so I do not understand the point. (A formal rendering in Section 5 using the logic formalism of A’s and Not-A’s reinforces my feeling that the conditionality principle is the one criticised and misunderstood.) If Mayo’s frequentist stance leads her to take the sampling distribution into account at all times, this is fine within her framework. But I do not see how this argument contributes to invalidate Birnbaum’s proof. The following and last sentence of the argument may bring some light on the reason why Mayo considers it does:

“The sampling distribution to arrive at Ev(E⁰,(j,yj)) would be the convex combination averaged over the two ways that yj could have occurred. This differs from the  sampling distributions of both Ev(E1,y1) and Ev(E2,y2).” (p.310)

Indeed, and rather obviously, the sampling distribution of the evidence Ev(E*,z*) will differ depending on the experiment. But this is not what is stated by the likelihood principle, which is that the inference itself should be the same for and . Not the distribution of this inference. This confusion between inference and its assessment is reproduced in the “Explicit Counterexample” section, where p-values are computed and found to differ for various conditional versions of a mixed experiment. Again, not a reason for invalidating the likelihood principle. So, in the end, I remain fully unconvinced by this demonstration that Birnbaum was wrong. (If in a bystander’s agreement with the fact that frequentist inference can be built conditional on ancillary statistics.)