## are there a frequentist and a Bayesian likelihoods?

Posted in Statistics with tags , , , , , , , , , , on June 7, 2018 by xi'an

A question that came up on X validated and led me to spot rather poor entries in Wikipedia about both the likelihood function and Bayes’ Theorem. Where unnecessary and confusing distinctions are made between the frequentist and Bayesian versions of these notions. I have already discussed the later (Bayes’ theorem) a fair amount here. The discussion about the likelihood is quite bemusing, in that the likelihood function is the … function of the parameter equal to the density indexed by this parameter at the observed value.

“What we can find from a sample is the likelihood of any particular value of r, if we define the likelihood as a quantity proportional to the probability that, from a population having the particular value of r, a sample having the observed value of r, should be obtained.” R.A. Fisher, On the “probable error’’ of a coefficient of correlation deduced from a small sample. Metron 1, 1921, p.24

By mentioning an informal side to likelihood (rather than to likelihood function), and then stating that the likelihood is not a probability in the frequentist version but a probability in the Bayesian version, the W page makes a complete and unnecessary mess. Whoever is ready to rewrite this introduction is more than welcome! (Which reminded me of an earlier question also on X validated asking why a common reference measure was needed to define a likelihood function.)

This also led me to read a recent paper by Alexander Etz, whom I met at E.J. Wagenmakers‘ lab in Amsterdam a few years ago. Following Fisher, as Jeffreys complained about

“..likelihood, a convenient term introduced by Professor R.A. Fisher, though in his usage it is sometimes multiplied by a constant factor. This is the probability of the observations given the original information and the hypothesis under discussion.” H. Jeffreys, Theory of Probability, 1939, p.28

Alexander defines the likelihood up to a constant, which causes extra-confusion, for free!, as there is no foundational reason to introduce this degree of freedom rather than imposing an exact equality with the density of the data (albeit with an arbitrary choice of dominating measure, never neglect the dominating measure!). The paper also repeats the message that the likelihood is not a probability (density, missing in the paper). And provides intuitions about maximum likelihood, likelihood ratio and Wald tests. But does not venture into a separate definition of the likelihood, being satisfied with the fundamental notion to be plugged into the magical formula

posteriorprior×likelihood

## fiducial inference

Posted in Books, Mountains, pictures, Running, Statistics, Travel with tags , , , , , , , , , , on October 30, 2017 by xi'an

In connection with my recent tale of the many ε’s, I received from Gunnar Taraldsen [from Tronheim, Norge] a paper [jointly written with Bo Lindqvist and just appeared on-line in JSPI] on conditional fiducial models.

“The role of the prior and the statistical model in Bayesian analysis is replaced by the use of the fiducial model x=R(θ,ε) in fiducial inference. The fiducial is obtained in this case without a prior distribution for the parameter.”

Reading this paper after addressing the X validated question made me understood better the fundamental wrongness of fiducial analysis! If I may herein object to Fisher himself… Indeed, when writing x=R(θ,ε), as the representation of the [observed] random variable x as a deterministic transform of a parameter θ and of an [unobserved] random factor ε, the two random variables x and ε are based on the same random preimage ω, i.e., x=x(ω) and ε=ε(ω). Observing x hence sets a massive constraint on the preimage ω and on the conditional distribution of ε=ε(ω). When the fiducial inference incorporates another level of randomness via an independent random variable ε’ and inverts x=R(θ,ε’) into θ=θ(x,ε’), assuming there is only one solution to the inversion, it modifies the nature of the underlying σ-algebra into something that is incompatible with the original model. Because of this sudden duplication of the random variates. While the inversion of this equation x=R(θ,ε’) gives an idea of the possible values of θ when ε varies according to its [prior] distribution, it does not account for the connection between x and ε. And does not turn the original parameter into a random variable with an implicit prior distribution.

As to conditional fiducial distributions, they are defined by inversion of x=R(θ,ε), under a certain constraint on θ, like C(θ)=0, which immediately raises a Pavlovian reaction in me, namely that since the curve C(θ)=0 has measure zero under the original fiducial distribution, how can this conditional solution be uniquely or at all defined. Or to avoid the Borel paradox mentioned in the paper. If I get the meaning of the authors in this section, the resulting fiducial distribution will actually depend on the choice of σ-algebra governing the projection.

“A further advantage of the fiducial approach in the case of a simple fiducial model is that independent samples are produced directly from independent sampling from [the fiducial distribution]. Bayesian simulations most often come as dependent samples from a Markov chain.”

This side argument in “favour” of the fiducial approach is most curious as it brings into the picture computational aspects that do not have any reason to be there. (The core of the paper is concerned with the unicity of the fiducial distribution in some univariate settings. Not with computational issues.)

## generating from a failure rate function [X’ed]

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , on July 4, 2015 by xi'an

While I now try to abstain from participating to the Cross Validated forum, as it proves too much of a time-consuming activity with little added value (in the sense that answers are much too often treated as disposable napkins by users who cannot be bothered to open a textbook and who usually do not exhibit any long-term impact of the provided answer, while clogging the forum with so many questions that the individual entries seem to get so little traffic, when compared say with the stackoverflow forum, to the point of making the analogy with disposable wipes more appropriate!), I came across a truly interesting question the other night. Truly interesting for me in that I had never considered the issue before.

The question is essentially wondering at how to simulate from a distribution defined by its failure rate function, which is connected with the density f of the distribution by

$\eta(t)=\frac{f(t)}{\int_t^\infty f(x)\,\text{d}x}=-\frac{\text{d}}{\text{d}t}\,\log \int_t^\infty f(x)\,\text{d}x$

From a purely probabilistic perspective, defining the distribution through f or through η is equivalent, as shown by the relation

$F(t)=1-\exp\left\{-\int_0^t\eta(x)\,\text{d}x\right\}$

but, from a simulation point of view, it may provide a different entry. Indeed, all that is needed is the ability to solve (in X) the equation

$\int_0^X\eta(x)\,\text{d}x=-\log(U)$

when U is a Uniform (0,1) variable. Which may help in that it does not require a derivation of f. Obviously, this also begs the question as to why would a distribution be defined by its failure rate function.

## probabilistic numerics

Posted in pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , on April 27, 2015 by xi'an

I attended an highly unusual workshop while in Warwick last week. Unusual for me, obviously. It was about probabilistic numerics, i.e., the use of probabilistic or stochastic arguments in the numerical resolution of (possibly) deterministic problems. The notion in this approach is fairly Bayesian in that it makes use to prior information or belief about the quantity of interest, e.g., a function, to construct an usually Gaussian process prior and derive both an estimator that is identical to a numerical method (e.g., Runge-Kutta or trapezoidal integration) and uncertainty or variability around this estimator. While I did not grasp much more than the classy introduction talk by Philipp Hennig, this concept sounds fairly interesting, if only because of the Bayesian connection, and I wonder if we will soon see a probability numerics section at ISBA! More seriously, placing priors on functions or functionals is a highly formal perspective (as in Bayesian non-parametrics) and it makes me wonder how much of the data (evaluation of a function at a given set of points) and how much of the prior is reflected in the output [variability]. (Obviously, one could also ask a similar question for statistical analyses!)  For instance, issues of singularity arise among those stochastic process priors.

Another question that stemmed from this talk is whether or not more efficient numerical methods can derived that way, in addition to recovering the most classical ones. Somewhat, somehow, given the idealised nature of the prior, it feels like priors could be more easily compared or ranked than in classical statistical problems. Since the aim is to figure out the value of an integral or the solution to an ODE. (Or maybe not, since again almost the same could be said about estimating a normal mean.)

## stopping rule impact

Posted in Books, R, Statistics, University life with tags , , , on May 9, 2014 by xi'an

Here is a question from my friend Shravan Vasishth about the consequences of using a stopping rule:

Psycholinguists and psychologists often adopt the following type of data-gathering procedure: The experimenter gathers n data points, then checks for significance (p<0.05 or not). If it’s not significant, he gets more data (n more data points). Since time and money are limited, he might decide to stop anyway at sample size, say, some multiple of n.  One can play with different scenarios here. A typical n might be 10 or 15.

This approach would give us a distribution of t-values and p-values under repeated sampling. Theoretically, under the standard assumptions of frequentist methods, we expect a Type I error to be 0.05. This is the case in standard analyses (I also track the t-statistic, in order to compare it with my stopping rule code below).

Here’s a simulation showing what happens. I wanted to ask you whether this simulation makes sense. I assume here that the experimenter gathers 10 data points, then checks for significance (p<0.05 or not). If it’s not significant, he gets more data (10 more data points). Since time and money are limited, he might decide to stop anyway at sample size 60. This gives us p-values under repeated sampling. Theoretically, under the standard assumptions of frequentist methods, we expect a Type I error to be 0.05. This is the case in standard analyses:
##Standard:
pvals<-NULL
tstat_standard<-NULL
n<-10 # sample size
nsim<-1000 # number of simulations
stddev<-1 # standard dev
mn<-0 ## mean

for(i in 1:nsim){
samp<-rnorm(n,mean=mn,sd=stddev)
pvals[i]<-t.test(samp)$p.value tstat_standard[i]<-t.test(samp)$statistic}

## Type I error rate: about 5% as theory says:
table(pvals<0.05)[2]/nsim


But the situation quickly deteriorates as soon as adopt the strategy I outlined above:

pvals<-NULL
tstat<-NULL
## how many subjects can I run?
upper_bound<-n*6

for(i in 1:nsim){
## at the outset we have no significant result:
significant<-FALSE
## null hyp is going to be true,
## so any rejection is a mistake.
## take sample:
x<-rnorm(n,mean=mn,sd=stddev)
while(!significant & length(x)<upper_bound){
## if not significant:
if(t.test(x)$p.value>0.05){ ## get more data: x<-append(x,rnorm(n,mean=mn,sd=stddev)) ## otherwise stop: } else {significant<-TRUE}} ## will be either significant or not: pvals[i]<-t.test(x)$p.value
tstat[i]<-t.test(x)\$statistic}


Now let’s compare the distribution of the t-statistic in the standard case vs with the above stopping rule. We get fatter tails with the above stopping rule, as shown by the histogram below.

Is this a correct way to think about the stopping rule problem?

To which I replied the following:

By adopting a stopping rule on a random iid sequence, you favour values in the sequence that agree with your stopping condition, hence modify the distribution of the outcome. To take an extreme example, if you draw N(0,1) variates until the empirical average is between -2 and 2, the average thus produced cannot remain N(0,1/n) but have a different distribution.

The t-test statistic you build from your experiment is no longer distributed as a uniform variate because of the stopping rule: the sample(x1,…,x10m) (with random size 10m [resulting from increases in the sample size by adding 10 more observations at a time] is distributed from

$\prod_{i=1}^{10m} \phi(x_i) \times \prod_{j=1}^{m-1} \mathbb{I}_{t(x_1,\ldots,x_{10j})>.05} \times \mathbb{I}_{t(x_1,\ldots,x_{10m})<.05}$

if 10m<60 [assuming the maximal acceptable sample size is 60] and from

$\prod_{i=1}^{60} \phi(x_i) \times \prod_{j=1}^{5} \mathbb{I}_{t(x_1,\ldots,x_{10j})>.05}$

otherwise. The histogram at the top of this post is the empirical distribution of the average of those observations, clearly far from a normal distribution.