## Is Jeffreys’ prior unique?

Posted in Books, Statistics, University life with tags , , , , , on March 3, 2015 by xi'an

“A striking characterisation showing the central importance of Fisher’s information in a differential framework is due to Cencov (1972), who shows that it is the only invariant Riemannian metric under symmetry conditions.” N. Polson, PhD Thesis, University of Nottingham, 1988

Following a discussion on Cross Validated, I wonder whether or not the affirmation that Jeffreys’ prior was the only prior construction rule that remains invariant under arbitrary (if smooth enough) reparameterisation. In the discussion, Paulo Marques mentioned Nikolaj Nikolaevič Čencov’s book, Statistical Decision Rules and Optimal Inference, Russian book from 1972, of which I had not heard previously and which seems too theoretical [from Paulo’s comments] to explain why this rule would be the sole one. As I kept looking for Čencov’s references on the Web, I found Nick Polson’s thesis and the above quote. So maybe Nick could tell us more!

However, my uncertainty about the uniqueness of Jeffreys’ rule stems from the fact that, f I decide on a favourite or reference parametrisation—as Jeffreys indirectly does when selecting the parametrisation associated with a constant Fisher information—and on a prior derivation from the sampling distribution for this parametrisation, I have derived a parametrisation invariant principle. Possibly silly and uninteresting from a Bayesian viewpoint but nonetheless invariant.

## aperiodic Gibbs sampler

Posted in Books, Kids, pictures, Statistics, Travel, University life with tags , , , , , , , on February 11, 2015 by xi'an

A question on Cross Validated led me to realise I had never truly considered the issue of periodic Gibbs samplers! In MCMC, non-aperiodic chains are a minor nuisance in that the skeleton trick of randomly subsampling the Markov chain leads to a aperiodic Markov chain. (The picture relates to the skeleton!)  Intuitively, while the systematic Gibbs sampler has a tendency to non-reversibility, it seems difficult to imagine a sequence of full conditionals that would force the chain away from the current value..!In the discrete case, given that the current state of the Markov chain has positive probability for the target distribution, the conditional probabilities are all positive as well and hence the Markov chain can stay at its current value after one Gibbs cycle, with positive probabilities, which means strong aperiodicity. In the continuous case, a similar argument applies by considering a neighbourhood of the current value. (Incidentally, the same person asked a question about the absolute continuity of the Gibbs kernel. Being confused by our chapter on the topic!!!)

## minimaxity of a Bayes estimator

Posted in Books, Kids, Statistics, University life with tags , , , , , on February 2, 2015 by xi'an

Today, while in Warwick, I spotted on Cross Validated a question involving “minimax” in the title and hence could not help but look at it! The way I first understood the question (and immediately replied to it) was to check whether or not the standard Normal average—reduced to the single Normal observation by sufficiency considerations—is a minimax estimator of the normal mean under an interval zero-one loss defined by

$\mathcal{L}(\mu,\hat{\mu})=\mathbb{I}_{|\mu-\hat\mu|>L}=\begin{cases}1 &\text{if }|\mu-\hat\mu|>L\\ 0&\text{if }|\mu-\hat{\mu}|\le L\\ \end{cases}$

where L is a positive tolerance bound. I had not seen this problem before, even though it sounds quite standard. In this setting, the identity estimator, i.e., the normal observation x, is indeed minimax as (a) it is a generalised Bayes estimator—Bayes estimators under this loss are given by the centre of an equal posterior interval—for this loss function under the constant prior and (b) it can be shown to be a limit of proper Bayes estimators and its Bayes risk is also the limit of the corresponding Bayes risks. (This is a most traditional way of establishing minimaxity for a generalised Bayes estimator.) However, this was not the question asked on the forum, as the book by Zacks it referred to stated that the standard Normal average maximised the minimal coverage, which amounts to the maximal risk under the above loss. With the strange inversion of parameter and estimator in the minimax risk:

$\sup_\mu\inf_{\hat\mu} R(\mu,\hat{\mu})\text{ instead of } \sup_\mu\inf_{\hat\mu} R(\mu,\hat{\mu})$

which makes the first bound equal to 0 by equating estimator and mean μ. Note however that I cannot access the whole book and hence may miss some restriction or other subtlety that would explain for this unusual definition. (As an aside, note that Cross Validated has a protection against serial upvoting, So voting up or down at once a large chunk of my answers on that site does not impact my “reputation”!)

## the density that did not exist…

Posted in Kids, R, Statistics, University life with tags , , , , on January 27, 2015 by xi'an

On Cross Validated, I had a rather extended discussion with a user about a probability density

$f(x_1,x_2)=\left(\dfrac{x_1}{x_2}\right)\left(\dfrac{\alpha}{x_2}\right)^{x_1-1}\exp\left\{-\left(\dfrac{\alpha}{x_2}\right)^{x_1} \right\}\mathbb{I}_{\mathbb{R}^*_+}(x_1,x_2)$

as I thought it could be decomposed in two manageable conditionals and simulated by Gibbs sampling. The first component led to a Gumbel like density

$g(y|x_2)\propto ye^{-y-e^{-y}} \quad\text{with}\quad y=\left(\alpha/x_2 \right)^{x_1}\stackrel{\text{def}}{=}\beta^{x_1}$

wirh y being restricted to either (0,1) or (1,∞) depending on β. The density is bounded and can be easily simulated by an accept-reject step. The second component leads to

$g(t|x_1)\propto \exp\{-\gamma ~ t \}~t^{-{1}/{x_1}} \quad\text{with}\quad t=\dfrac{1}{{x_2}^{x_1}}$

which offers the slight difficulty that it is not integrable when the first component is less than 1! So the above density does not exist (as a probability density).

What I found interesting in this question was that, for once, the Gibbs sampler was the solution rather than the problem, i.e., that it pointed out the lack of integrability of the joint. (What I found less interesting was that the user did not acknowledge a lengthy discussion that we had previously about the Gibbs implementation and that he erased, that he lost interest in the question by not following up on my answer, a seemingly common feature of his‘, and that he did not provide neither source nor motivation for this zombie density.)

## simulation by inverse cdf

Posted in Books, Kids, R, Statistics, University life with tags , , , , , on January 14, 2015 by xi'an

Another Cross Validated forum question that led me to an interesting (?) reconsideration of certitudes! When simulating from a normal distribution, is Box-Muller algorithm better or worse than using the inverse cdf transform? My first reaction was to state that Box-Muller was exact while the inverse cdf relied on the coding of the inverse cdf, like qnorm() in R. Upon reflection and commenting by other members of the forum, like William Huber, I came to moderate this perspective since Box-Muller also relies on transcendental functions like sin and log, hence writing

$X=\sqrt{-2\log(U_1)}\,\sin(2\pi U_2)$

also involves approximating in the coding of those functions. While it is feasible to avoid the call to trigonometric functions (see, e.g., Algorithm A.8 in our book), the call to the logarithm seems inescapable. So it ends up with the issue of which of the two functions is better coded, both in terms of speed and precision. Surprisingly, when coding in R, the inverse cdf may be the winner: here is the comparison I ran at the time I wrote my comments

> system.time(qnorm(runif(10^8)))
sutilisateur     système      écoulé
10.137           0.120      10.251
> system.time(rnorm(10^8))
utilisateur     système      écoulé
13.417           0.060      13.472`

However re-rerunning it today, I get opposite results (pardon my French, I failed to turn the messages to English):

> system.time(qnorm(runif(10^8)))
utilisateur     système      écoulé
10.137       0.144      10.274
> system.time(rnorm(10^8))
utilisateur     système      écoulé
7.894       0.060       7.948

(There is coherence in the system time, which shows rnorm as twice as fast as the call to qnorm.) In terms, of precision, I could not spot a divergence from normality, either through a ks.test over 10⁸ simulations or in checking the tails:

“Only the inversion method is inadmissible because it is slower and less space efficient than all of the other methods, the table methods excepted”. Luc Devroye, Non-uniform random variate generation, 1985

Update: As pointed out by Radford Neal in his comment, the above comparison is meaningless because the function rnorm() is by default based on the inversion of qnorm()! As indicated by Alexander Blocker in another comment, to use an other generator requires calling RNG as in

RNGkind(normal.kind = “Box-Muller”)

(And thanks to Jean-Louis Foulley for salvaging this quote from Luc Devroye, which does not appear to apply to the current coding of the Gaussian inverse cdf.)

## which parameters are U-estimable?

Posted in Books, Kids, Statistics, University life with tags , , , , , , , on January 13, 2015 by xi'an

Today (01/06) was a double epiphany in that I realised that one of my long-time beliefs about unbiased estimators did not hold. Indeed, when checking on Cross Validated, I found this question: For which distributions is there a closed-form unbiased estimator for the standard deviation? And the presentation includes the normal case for which indeed there exists an unbiased estimator of σ, namely

$\frac{\Gamma(\{n-1\}/{2})}{\Gamma({n}/{2})}2^{-1/2}\sqrt{\sum_{k=1}^n(x_i-\bar{x})^2}$

which derives directly from the chi-square distribution of the sum of squares divided by σ². When thinking further about it, if a posteriori!, it is now fairly obvious given that σ is a scale parameter. Better, any power of σ can be similarly estimated in a unbiased manner, since

$\left\{\sum_{k=1}^n(x_i-\bar{x})^2\right\}^\alpha \propto\sigma^\alpha\,.$

And this property extends to all location-scale models.

So how on Earth was I so convinced that there was no unbiased estimator of σ?! I think it stems from reading too quickly a result in, I think, Lehmann and Casella, result due to Peter Bickel and Erich Lehmann that states that, for a convex family of distributions F, there exists an unbiased estimator of a functional q(F) (for a sample size n large enough) if and only if q(αF+(1α)G) is a polynomial in 0α1. Because of this, I had this

impression that only polynomials of the natural parameters of exponential families can be estimated by unbiased estimators… Note that Bickel’s and Lehmann’s theorem does not apply to the problem here because the collection of Gaussian distributions is not convex (a mixture of Gaussians is not a Gaussian).

This leaves open the question as to which transforms of the parameter(s) are unbiasedly estimable (or U-estimable) for a given parametric family, like the normal N(μ,σ²). I checked in Lehmann’s first edition earlier today and could not find an answer, besides the definition of U-estimability. Not only the question is interesting per se but the answer could come to correct my long-going impression that unbiasedness is a rare event, i.e., that the collection of transforms of the model parameter that are U-estimable is a very small subset of the whole collection of transforms.

## Bhattacharyya distance versus Kullback-Leibler divergence

Posted in Books, Kids, Statistics with tags , , , , on January 10, 2015 by xi'an

Another question I picked on Cross Validated during the Yule break is about the connection between the Bhattacharyya distance and the Kullback-Leibler divergence, i.e.,

$d_B(p,q)=-\log\left\{\int\sqrt{p(x)q(x)}\,\text{d}x\right\}$

and

$d_{KL}(p\|q)=\int\log\left\{{q(x)}\big/{p(x)}\right\}\,p(x)\,\text{d}x$

Although this Bhattacharyya distance sounds close to the Hellinger distance,

$d_H(p,q)=\left\{1-\int\sqrt{p(x)q(x)}\,\text{d}x\right\}^{1/2}$

the ordering I got by a simple Jensen inequality is

$d_{KL}(p\|q)\ge2d_B(p,q)\ge2d_H(p,q)^2\,.$

and I wonder how useful this ordering could be…