## hitting a wall

Posted in Books, Kids, R, Statistics, University life with tags , , , , , on July 5, 2018 by xi'an

Once in a while, or a wee bit more frequently (!), it proves impossible to communicate with a contributor of a question on X validated. A recent instance was about simulating from a multivariate kernel density estimate where the kernel terms at x¹,x²,… are Gaussian kernels applied to the inverses of the norms |x-x¹|, |x-x²|,… rather than to the norms as in the usual formulation. The reason for using this type of kernel is unclear, as it certainly does not converge to an estimate of the density of the sample x¹,x²,…  as the sample size grows, since it excludes a neighbourhood of each point in the sample. Since the kernel term tends to a non-zero constant at infinity, the support of the density estimate is restricted to the hypercube [0,1]x…x[0,1], again with unclear motivations. No mention being made of the bandwidth adopted for this kernel. If one takes this exotic density as a given, the question is rather straightforward as the support is compact, the density bounded and a vanilla accept-reject can be implemented. As illustrated by the massive number of comments on that entry, it did not work as the contributor adopted a fairly bellicose attitude about suggestions from moderators on that site and could not see the point in our requests for clarification, despite plotting a version of the kernel that had its maximum [and not its minimum] at x¹… After a few attempts, including writing a complete answer, from which the above graph is taken (based on an initial understanding of the support being for (x-x¹), …), I gave up and deleted all my entries.On that question.

## are there a frequentist and a Bayesian likelihoods?

Posted in Statistics with tags , , , , , , , , , , on June 7, 2018 by xi'an

A question that came up on X validated and led me to spot rather poor entries in Wikipedia about both the likelihood function and Bayes’ Theorem. Where unnecessary and confusing distinctions are made between the frequentist and Bayesian versions of these notions. I have already discussed the later (Bayes’ theorem) a fair amount here. The discussion about the likelihood is quite bemusing, in that the likelihood function is the … function of the parameter equal to the density indexed by this parameter at the observed value.

“What we can find from a sample is the likelihood of any particular value of r, if we define the likelihood as a quantity proportional to the probability that, from a population having the particular value of r, a sample having the observed value of r, should be obtained.” R.A. Fisher, On the “probable error’’ of a coefficient of correlation deduced from a small sample. Metron 1, 1921, p.24

By mentioning an informal side to likelihood (rather than to likelihood function), and then stating that the likelihood is not a probability in the frequentist version but a probability in the Bayesian version, the W page makes a complete and unnecessary mess. Whoever is ready to rewrite this introduction is more than welcome! (Which reminded me of an earlier question also on X validated asking why a common reference measure was needed to define a likelihood function.)

This also led me to read a recent paper by Alexander Etz, whom I met at E.J. Wagenmakers‘ lab in Amsterdam a few years ago. Following Fisher, as Jeffreys complained about

“..likelihood, a convenient term introduced by Professor R.A. Fisher, though in his usage it is sometimes multiplied by a constant factor. This is the probability of the observations given the original information and the hypothesis under discussion.” H. Jeffreys, Theory of Probability, 1939, p.28

Alexander defines the likelihood up to a constant, which causes extra-confusion, for free!, as there is no foundational reason to introduce this degree of freedom rather than imposing an exact equality with the density of the data (albeit with an arbitrary choice of dominating measure, never neglect the dominating measure!). The paper also repeats the message that the likelihood is not a probability (density, missing in the paper). And provides intuitions about maximum likelihood, likelihood ratio and Wald tests. But does not venture into a separate definition of the likelihood, being satisfied with the fundamental notion to be plugged into the magical formula

posteriorprior×likelihood

## prior against truth!

Posted in Books, Kids, Statistics with tags , , , , , , , on June 4, 2018 by xi'an

A question from X validated had interesting ramifications, about what happens when the prior does not cover the true value of the parameter (assuming there ? In fact, not so much in that, from a decision theoretic perspective, the fact that that π(θ⁰)=0, or even that π(θ)=0 in a neighbourhood of θ⁰ does not matter [too much]. Indeed, the formal derivation of a Bayes estimator as minimising the posterior loss means that the resulting estimator may take values that were “impossible” from a prior perspective! Indeed, taking for example the posterior mean, the convex combination of all possible values of θ under π may well escape the support of π when this support is not convex. Of course, one could argue that estimators should further be restricted to be possible values of θ under π but that would reduce their decision theoretic efficiency.

An example is the brilliant minimaxity result by George Casella and Bill Strawderman from 1981: when estimating a Normal mean μ based on a single observation xwith the additional constraint that |μ|<ρ, and when ρ is small enough, ρ1.0567 quite specifically, the minimax estimator for this problem under squared error loss corresponds to a (least favourable) uniform prior on the pair {ρ,ρ}, meaning that π gives equal weight to ρ and ρ (and none to any other value of the mean μ). When ρ increases above this bound, the least favourable prior sees its support growing one point at a time, but remaining a finite set of possible values. However the posterior expectation, 𝔼[μ|x], can take any value on (ρ,ρ).

In an even broader suspension of belief (in the prior), it may be that the prior has such a restricted support that it cannot consistently estimate the (true value of the) parameter, but the associated estimator may remain admissible or minimax.

## best unbiased estimator of θ² for a Poisson model

Posted in Books, Kids, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , on May 23, 2018 by xi'an

A mostly traditional question on X validated about the “best” [minimum variance] unbiased estimator of θ² from a Poisson P(θ) sample leads to the Rao-Blackwell solution

$\mathbb{E}[X_1X_2|\underbrace{\sum_{i=1}^n X_i}_S=s] = -\frac{s}{n^2}+\frac{s^2}{n^2}=\frac{s(s-1)}{n^2}$

and a similar estimator could be constructed for θ³, θ⁴, … With the interesting limitation that this procedure stops at the power equal to the number of observations (minus one?). But,  since the expectation of a power of the sufficient statistics S [with distribution P(nθ)] is a polynomial in θ, there is de facto no limitation. More interestingly, there is no unbiased estimator of negative powers of θ in this context, while this neat comparison on Wikipedia (borrowed from the great book of counter-examples by Romano and Siegel, 1986, selling for a mere \$180 on amazon!) shows why looking for an unbiased estimator of exp(-2θ) is particularly foolish: the only solution is (-1) to the power S [for a single observation]. (There is however a first way to circumvent the difficulty if having access to an arbitrary number of generations from the Poisson, since the Forsythe – von Neuman algorithm allows for an unbiased estimation of exp(-F(x)). And, as a second way, as remarked by Juho Kokkala below, a sample of at least two Poisson observations leads to a more coherent best unbiased estimator.)

## what is a large Kullback-Leibler divergence?

Posted in Books, Kids, pictures, Statistics with tags , , on May 2, 2018 by xi'an

A question that came up on X validated is about scaling a Kullback-Leibler divergence. A fairly interesting question in my opinion since this pseudo-distance is neither naturally nor universally scaled. Take for instance the divergence between two Gaussian

$\text{KL}(p, q) = \log \frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2 \sigma_2^2} - \frac{1}{2}$

which is scaled by the standard deviation of the second Normal. There is no absolute bound in this distance for which it can be seen as large. Bypassing the coding analogy from signal processing, which has never been clear to me, he only calibration I can think of is statistical, namely to figure out a value extreme for two samples from the same distribution. In the sense of the Kullback between the corresponding estimated distributions. The above is an illustration, providing the distribution of the Kullback-Leibler divergences from samples from a Gamma distribution, for sample sizes n=15 and n=150. The sample size obviously matters.

## ABC with no prior

Posted in Books, Kids, pictures with tags , , , , , , on April 30, 2018 by xi'an

“I’m trying to fit a complex model to some data that take a large amount of time to run. I’m also unable to write down a Likelihood function to this problem and so I turned to approximate Bayesian computation (ABC). Now, given the slowness of my simulations, I used Sequential ABC (…) In fact, contrary to the concept of Bayesian statistics (new knowledge updating old knowledge) I would like to remove all the influence of the priors from my estimates. “

A question from X validated where I have little to contribute as the originator of the problem had the uttermost difficulties to understand that ABC could not be run without a probability structure on the parameter space. Maybe a fiducialist in disguise?! To this purpose this person simulated from a collection of priors and took the best 5% across the priors, which is akin to either running a mixture prior or to use ABC for conducting prior choice, which reminds me of a paper of Toni et al. Not that it helps removing “all the influence of the priors”, of course…

An unrelated item of uninteresting trivia is that a question I posted in 2012 on behalf of my former student Gholamossein Gholami about the possibility to use EM to derive a Weibull maximum likelihood estimator (instead of sheer numerical optimisation) got over the 10⁴ views. But no answer so far!

## accelerating MCMC

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , on April 11, 2018 by xi'an

As forecasted a rather long while ago (!), I wrote a short and incomplete survey on some approaches to accelerating MCMC. With the massive help of Victor Elvira (Lille), Nick Tawn (Warwick) and Changye Wu (Dauphine). Survey which current version just got arXived and which has now been accepted by WIREs Computational Statistics. The typology (and even the range of methods) adopted here is certainly mostly arbitrary, with suggestions for different divisions made by a very involved and helpful reviewer. While we achieved a quick conclusion to the review process, suggestions and comments are most welcome! Even if we cannot include every possible suggestion, just like those already made on X validated. (WIREs stands for Wiley Interdisciplinary Reviews and its dozen topics cover several fields, from computational stats to biology, to medicine, to engineering.)