## a paradox in decision-theoretic interval estimation (solved)

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , , on October 4, 2012 by xi'an

In 1993, we wrote a paper [with George Casella and Gene/Juinn Hwang] on the paradoxical consequences of using the loss function

$\text{length}(C) - k \mathbb{I}_C(\theta)$

(published in Statistica Sinica, 3, 141-155) since it led to the following property: for the standard normal mean estimation problem, the regular confidence interval is dominated by the modified confidence interval equal to the empty set when is too large… This was first pointed out by Jim Berger and the most natural culprit is the artificial loss function where the first part is unbounded while the second part is bounded by k. Recently, Paul Kabaila—whom I met in both Adelaide, where he quite appropriately commented about the abnormal talk at the conference!,  and Melbourne, where we met with his students after my seminar at the University of Melbourne—published a paper (first on arXiv then in Statistics and Probability Letters) where he demonstrates that the mere modification of the above loss into

$\dfrac{\text{length}(C)}{\sigma} - k \mathbb{I}_C(\theta)$

solves the paradox:! For Jeffreys’ non-informative prior, the Bayes (optimal) estimate is the regular confidence interval. besides doing the trick, this nice resolution explains the earlier paradox as being linked to a lack of invariance in the (earlier) loss function. This is somehow satisfactory since Jeffreys’ prior also is the invariant prior in this case.

## Confidence distributions

Posted in Books, Statistics, Travel, University life with tags , , , , , , , , on June 11, 2012 by xi'an

I was asked by the International Statistical Review editor, Marc Hallin, for a discussion of the paper “Confidence distribution, the frequentist distribution estimator of a parameter — a review” by Min-ge Xie and Kesar Singh, both from Rutgers University. Although the paper is not available on-line, similar and recent reviews and articles can be found, in an 2007 IMS Monograph and a 2012 JASA paper both with Bill Strawderman, as well as a chapter in the recent Fetschrift for Bill Strawderman. The notion of confidence distribution is quite similar to the one of fiducial distribution, introduced by R.A. Fisher, and they both share in my opinion the same drawback, namely that they aim at a distribution over the parameter space without specifying (at least explicitly) a prior distribution. Furthermore, the way the confidence distribution is defined perpetuates the on-going confusion between confidence and credible intervals, in that the cdf on the parameter θ is derived via the inversion of a confidence upper bound (or, equivalently, of a p-value…) Even though this inversion properly defines a cdf on the parameter space, there is no particular validity in the derivation. Either the confidence distribution corresponds to a genuine posterior distribution, in which case I think the only possible interpretation is a Bayesian one. Or  the confidence distribution does not correspond to a genuine posterior distribution, because no prior can lead to this distribution, in which case there is a probabilistic impossibility in using this distribution.  Thus, as a result, my discussion (now posted on arXiv) is rather negative about the benefits of this notion of confidence distribution.

One entry in the review, albeit peripheral, attracted my attention. The authors mention a tech’ report where they exhibit a paradoxical behaviour of a Bayesian procedure: given a (skewed) prior on a pair (p0,p1), and a binomial likelihood, the posterior distribution on p1-p0 has its main mass in the tails of both the prior and the likelihood (“the marginal posterior of d = p1-p0 is more extreme than its prior and data evidence!”). The information provided in the paper is rather sparse on the genuine experiment and looking at two possible priors exhibited nothing of the kind… I went to the authors’ webpages and found a more precise explanation on Min-ge Xie’s page:

Although the contour plot of the posterior distribution sits between those of the prior distribution and the likelihood function, its projected peak is more extreme than the other two. Further examination suggests that this phenomenon is genuine in binomial clinical trials and it would not go away even if we adopt other (skewed) priors (for example, the independent beta priors used in Joseph et al. (1997)). In fact, as long as the center of a posterior distribution is not on the line joining the two centers of the joint prior and likelihood function (as it is often the case with skewed distributions), there exists a direction along which the marginal posterior fails to fall between the prior and likelihood function of the same parameter.

and a link to another paper. Reading through the paper (and in particular Section 4), it appears that the above “paradoxical” picture is the result of the projections of the joint distributions represented in this second picture. By projection, I presume the authors mean integrating out the second component, e.g. p1+p0. This indeed provides the marginal prior of p1-p0, the marginal posterior of p1-p0, but…not the marginal likelihood of p1-p0! This entity is not defined, once again because there is no reference measure on the parameter space which could justify integrating out some parameters in the likelihood. (Overall, I do not think the “paradox” is overwhelming: the joint posterior distribution does precisely the merging of prior and data information we would expect and it is not like the marginal posterior is located in zones with zero prior probability and zero (profile) likelihood. I am also always wary of arguments based on modes, since those are highly dependent on parameterisation.)

Most unfortunately, when searching for more information on the authors’ webpages, I came upon the sad news that Professor Singh had passed away three weeks ago, at the age of 56.  (Professor Xie wrote a touching eulogy of his friend and co-author.) I had only met briefly with Professor Singh during my visit to Rutgers two months ago, but he sounded like an academic who would have enjoyed the kind of debate drafted by my discussion. To the much more important loss to family, friends and faculty represented by Professor Singh demise, I thus add the loss of missing the intellectual challenge of crossing arguments with him. And I look forward discussing the issues with the first author of the paper, Professor Xie.

## loss functions for credible regions

Posted in Statistics, University life with tags , , , , on March 15, 2012 by xi'an

When Éric Marchand came to give a talk last week, we discussed about minimality and Bayesian estimation for confidence/credible regions. In the early 1990’s, George Casella and I wrote a paper in this direction, entitled “Distance weighted losses for testing and confidence set evaluation” and published in TEST. It was restricted to the univariate case but one could consider evaluating α-level confidence regions with a loss function like

$L(\theta,C) = \left(\theta-\text{proj}_C(\theta)\right)^2$

where the projection of the parameter over C is the element in C that is closest to the parameter. As in the original paper, this loss function brings a penalty of how far is the parameter from the region, compared the rudimentary 0-1 loss function which penalises all misses the same way. The posterior loss is not straightforward to minimise, though. Unless one considers an approximation based on a sample from the posterior and picks the (1-α)-fraction that gives the smallest sum of distances to the remaining α-fraction. And then takes a convexification of the α-fraction. This is not particularly “clean” and I would prefer to find an HPD-like region, i.e. an HPD linked to a modified prior… But this may require another loss function than the one above. Incidentally, I was also playing with an alternative loss function that would avoid setting the level α. Namely

$L(\theta,C) = \left(\theta-\text{proj}_C(\theta)\right)^2 + \tau\, \text{diam}(C)^2,$

which simultaneously penalises non-coverage and size. However, the choice of τ makes the function difficult to motivate in a realistic setting.

## Bayes posterior just quick and dirty on X’idated

Posted in Statistics, Travel, University life with tags , , , , on February 22, 2012 by xi'an

As a coincidence, I noticed that Don Fraser’s recent discussion paper `Is Bayes posterior just quick and dirty confidence?’ will be discussed this Friday (18:00 UTC) on the Cross Validated Journal Club. I do not know whether or not to interpret the information “The author confirmed his presence at the event” as meaning Don Fraser will be on line to discuss his paper with X’ed members Feel free to join anyway if you have 20 reputation points or plan to get those by Friday! (I will be in the train coming back from Oxford. Oxford, England, not Mississippi!)

## Bayesian inference and the parametric bootstrap

Posted in R, Statistics, University life with tags , , , , , , , , on December 16, 2011 by xi'an

This paper by Brad Efron came to my knowledge when I was looking for references on Bayesian bootstrap to answer a Cross Validated question. After reading it more thoroughly, “Bayesian inference and the parametric bootstrap” puzzles me, which most certainly means I have missed the main point. Indeed, the paper relies on parametric bootstrap—a frequentist approximation technique mostly based on simulation from a plug-in distribution and a robust inferential method estimating distributions from empirical cdfs—to assess (frequentist) coverage properties of Bayesian posteriors. The manuscript mixes a parametric bootstrap simulation output for posterior inference—even though bootstrap produces simulations of estimators while the posterior distribution operates on the parameter space, those  estimator simulations can nonetheless be recycled as parameter simulation by a genuine importance sampling argument—and the coverage properties of Jeffreys posteriors vs. the BCa [which stands for bias-corrected and accelerated, see Efron 1987] confidence density—which truly take place in different spaces. Efron however connects both spaces by taking advantage of the importance sampling connection and defines a corrected BCa prior to make the confidence intervals match. While in my opinion this does not define a prior in the Bayesian sense, since the correction seems to depend on the data. And I see no strong incentive to match the frequentist coverage, because this would furthermore define a new prior for each component of the parameter. This study about the frequentist properties of Bayesian credible intervals reminded me of the recent discussion paper by Don Fraser on the topic, which follows the same argument that Bayesian credible regions are not necessarily good frequentist confidence intervals.

The conclusion of the paper is made of several points, some of which may not be strongly supported by the previous analysis:

1. “The parametric bootstrap distribution is a favorable starting point for importance sampling computation of Bayes posterior distributions.” [I am not so certain about this point given that the bootstrap is based on a pluggin estimate, hence fails to account for the variability of this estimate, and may thus induce infinite variance behaviour, as in the harmonic mean estimator of Newton and Raftery (1994). Because the tails of the importance density are those of the likelihood, the heavier tails of the posterior induced by the convolution with the prior distribution are likely to lead to this fatal misbehaviour of the importance sampling estimator.]
2. “This computation is implemented by reweighting the bootstrap replications rather than by drawing observations directly from the posterior distribution as with MCMC.” [Computing the importance ratio requires the availability both of the likelihood function and of the likelihood estimator, which means a setting where Bayesian computations are not particularly hindered and do not necessarily call for advanced MCMC schemes.]
3. “The necessary weights are easily computed in exponential families for any prior, but are particularly simple starting from Jeffreys invariant prior, in which case they depend only on the deviance difference.” [Always from a computational perspective, the ease of computing the importance weights is mirrored by the ease in handling the posterior distributions.]
4. “The deviance difference depends asymptotically on the skewness of the family, having a cubic normal form.” [No relevant comment.]
5. “In our examples, Jeffreys prior yielded posterior distributions not much different than the unweighted bootstrap distribution. This may be unsatisfactory for single parameters of interest in multi-parameter families.” [The frequentist confidence properties of Jeffreys priors have already been examined in the past and be found to be lacking in multidimensional settings. This is an assessment finding Jeffreys priors lacking from a frequentist perspective. However, the use of Jeffreys prior is not justified on this particular ground.]
6. “Better uninformative priors, such as the Welch and Peers family or reference priors, are closely related to the frequentist BCa reweighting formula.” [The paper only finds proximities in two examples, but it does not assess this relation in a wider generality. Again, this is not particularly relevant from a Bayesian viewpoint.]
7. “Because of the i.i.d. nature of bootstrap resampling, simple formulas exist for the accuracy of posterior computations as a function of the number B of bootstrap replications. Even with excessive choices of B, computation time was measured in seconds for our examples.” [This is not very surprising. It however assesses Bayesian procedures from a frequentist viewpoint, so this may be lost on both Bayesian and frequentist users…]
8. “An efficient second-level bootstrap algorithm (“bootstrap-after-bootstrap”) provides estimates for the frequentist accuracy of Bayesian inferences.” [This is completely correct and why bootstrap is such an appealing technique for frequentist inference. I spent the past two weeks teaching non-parametric bootstrap to my R class and the students are now fluent with the concept, even though they are unsure about the meaning of estimation and testing!]
9. “This can be important in assessing inferences based on formulaic priors, such as those of Jeffreys, rather than on genuine prior experience.” [Again, this is neither very surprising nor particularly appealing to Bayesian users.]

In conclusion, I found the paper quite thought-provoking and stimulating, definitely opening new vistas in a very elegant way. I however remain unconvinced by the simulation aspects from a purely Monte Carlo perspective.

## Don Fraser’s rejoinder

Posted in Books, Statistics, University life with tags , , , , , , on August 24, 2011 by xi'an

“How can a discipline, central to science and to critical thinking, have two methodologies, two logics, two approaches that frequently give substantially different answers to the same problems. Any astute person from outside would say “Why don’t they put their house in order?”” Don Fraser

Following the discussions of his Statistical Science paper Is Bayes posterior just quick and dirty confidence?, by Kesar Singh and Minge Xie, Larry Wasserman (who coined the neologism Frasian for the occasion), Tong Zhang, and myself, Don Fraser has written his rejoinder to the discussion (although in Biometrika style it is for Statistical Science!). His conclusion that “no one argued that the use of the conditional probability lemma with an imaginary input had powers beyond confidence, supernatural powers” is difficult to escape, as I would not dream of promoting a super-Bayes jumping to the rescue of bystanders misled by evil frequentists!!! More seriously, this rejoinder makes me reflect on lectures from the past years, from those on the diverse notions of probability (Jeffreys, Keynes, von Mises, and Burdzy) to those on scientific discovery (mostly Seber‘s, and the promising Error and Inference by Mayo and Spanos I just received).