## why is the likelihood not a pdf?

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , on January 4, 2021 by xi'an

The return of an old debate on X validated. Can the likelihood be a pdf?! Even though there exist cases where a [version of the] likelihood function shows such a symmetry between the sufficient statistic and the parameter, as e.g. in the Normal mean model, that they are somewhat exchangeable w.r.t. the same measure, the question is somewhat meaningless for a number of reasons that we can all link to Ronald Fisher:

1. when defining the likelihood function, Fisher (in his 1912 undergraduate memoir!) warns against integrating it w.r.t. the parameter: “the integration with respect to m is illegitimate and has no definite meaning with respect to inverse probability”. The likelihood is “is a relative probability only, suitable to compare point with point, but incapable of being interpreted as a probability distribution over a region, or of giving any estimate of absolute probability.” And again in 1922: “[the likelihood] is not a differential element, and is incapable of being integrated: it is assigned to a particular point of the range of variation, not to a particular element of it”.
2. He introduced the term “likelihood” especially to avoid the confusion: “I perceive that the word probability is wrongly used in such a connection: probability is a ratio of frequencies, and about the frequencies of such values we can know nothing whatever (…) I suggest that we may speak without confusion of the likelihood of one value of p being thrice the likelihood of another (…) likelihood is not here used loosely as a synonym of probability, but simply to express the relative frequencies with which such values of the hypothetical quantity p would in fact yield the observed sample”.
3. Another point he makes repeatedly (both in 1912 and 1922) is the lack of invariance of the probability measure obtained by attaching a dθ to the likelihood function L(θ) and normalising it into a density: while the likelihood “is entirely unchanged by any [one-to-one] transformation”, this definition of a probability distribution is not. Fisher actually distanced himself from a Bayesian “uniform prior” throughout the 1920’s.

which sums up as the urge to never neglect the dominating measure!

## Mea Culpa

Posted in Statistics with tags , , , , , , , , , , , on April 10, 2020 by xi'an

[A quote from Jaynes about improper priors that I had missed in his book, Probability Theory.]

For many years, the present writer was caught in this error just as badly as anybody else, because Bayesian calculations with improper priors continued to give just the reasonable and clearly correct results that common sense demanded. So warnings about improper priors went unheeded; just that psychological phenomenon. Finally, it was the marginalization paradox that forced recognition that we had only been lucky in our choice of problems. If we wish to consider an improper prior, the only correct way of doing it is to approach it as a well-defined limit of a sequence of proper priors. If the correct limiting procedure should yield an improper posterior pdf for some parameter α, then probability theory is telling us that the prior information and data are too meager to permit any inferences about α. Then the only remedy is to seek more data or more prior information; probability theory does not guarantee in advance that it will lead us to a useful answer to every conceivable question.Generally, the posterior pdf is better behaved than the prior because of the extra information in the likelihood function, and the correct limiting procedure yields a useful posterior pdf that is analytically simpler than any from a proper prior. The most universally useful results of Bayesian analysis obtained in the past are of this type, because they tended to be rather simple problems, in which the data were indeed so much more informative than the prior information that an improper prior gave a reasonable approximation – good enough for all practical purposes – to the strictly correct results (the two results agreed typically to six or more significant figures).

In the future, however, we cannot expect this to continue because the field is turning to more complex problems in which the prior information is essential and the solution is found by computer. In these cases it would be quite wrong to think of passing to an improper prior. That would lead usually to computer crashes; and, even if a crash is avoided, the conclusions would still be, almost always, quantitatively wrong. But, since likelihood functions are bounded, the analytical solution with proper priors is always guaranteed to converge properly to finite results; therefore it is always possible to write a computer program in such a way (avoid underflow, etc.) that it cannot crash when given proper priors. So, even if the criticisms of improper priors on grounds of marginalization were unjustified,it remains true that in the future we shall be concerned necessarily with proper priors.

## my likelihood is dominating my prior [not!]

Posted in Kids, Statistics with tags , , , , , on August 29, 2019 by xi'an

An interesting misconception read on X validated today, with a confusion between the absolute value of the likelihood function and its variability. Which I have trouble explaining except possibly by the extrapolation from the discrete case and a confusion between the probability density of the data [scaled as a probability] and the likelihood function [scale-less]. I also had trouble convincing the originator of the question of the irrelevance of the scale of the likelihood per se, even when demonstrating that |$$𝚺|$$ could vanish from the posterior with no consequence whatsoever. It is only when I thought of the case when the likelihood is constant in $$𝜃$$ that I managed to make my case.

## are there a frequentist and a Bayesian likelihoods?

Posted in Statistics with tags , , , , , , , , , , on June 7, 2018 by xi'an

A question that came up on X validated and led me to spot rather poor entries in Wikipedia about both the likelihood function and Bayes’ Theorem. Where unnecessary and confusing distinctions are made between the frequentist and Bayesian versions of these notions. I have already discussed the later (Bayes’ theorem) a fair amount here. The discussion about the likelihood is quite bemusing, in that the likelihood function is the … function of the parameter equal to the density indexed by this parameter at the observed value.

“What we can find from a sample is the likelihood of any particular value of r, if we define the likelihood as a quantity proportional to the probability that, from a population having the particular value of r, a sample having the observed value of r, should be obtained.” R.A. Fisher, On the “probable error’’ of a coefficient of correlation deduced from a small sample. Metron 1, 1921, p.24

By mentioning an informal side to likelihood (rather than to likelihood function), and then stating that the likelihood is not a probability in the frequentist version but a probability in the Bayesian version, the W page makes a complete and unnecessary mess. Whoever is ready to rewrite this introduction is more than welcome! (Which reminded me of an earlier question also on X validated asking why a common reference measure was needed to define a likelihood function.)

This also led me to read a recent paper by Alexander Etz, whom I met at E.J. Wagenmakers‘ lab in Amsterdam a few years ago. Following Fisher, as Jeffreys complained about

“..likelihood, a convenient term introduced by Professor R.A. Fisher, though in his usage it is sometimes multiplied by a constant factor. This is the probability of the observations given the original information and the hypothesis under discussion.” H. Jeffreys, Theory of Probability, 1939, p.28

Alexander defines the likelihood up to a constant, which causes extra-confusion, for free!, as there is no foundational reason to introduce this degree of freedom rather than imposing an exact equality with the density of the data (albeit with an arbitrary choice of dominating measure, never neglect the dominating measure!). The paper also repeats the message that the likelihood is not a probability (density, missing in the paper). And provides intuitions about maximum likelihood, likelihood ratio and Wald tests. But does not venture into a separate definition of the likelihood, being satisfied with the fundamental notion to be plugged into the magical formula

posteriorprior×likelihood

## likelihood inflating sampling algorithm

Posted in Books, Statistics, University life with tags , , , , , , , , on May 24, 2016 by xi'an

My friends from Toronto Radu Craiu and Jeff Rosenthal have arXived a paper along with Reihaneh Entezari on MCMC scaling for large datasets, in the spirit of Scott et al.’s (2013) consensus Monte Carlo. They devised an likelihood inflated algorithm that brings a novel perspective to the problem of large datasets. This question relates to earlier approaches like consensus Monte Carlo, but also kernel and Weierstrass subsampling, already discussed on this blog, as well as current research I am conducting with my PhD student Changye Wu. The approach by Entezari et al. is somewhat similar to consensus Monte Carlo and the other solutions in that they consider an inflated (i.e., one taken to the right power) likelihood based on a subsample, with the full sample being recovered by importance sampling. Somewhat unsurprisingly this approach leads to a less dispersed estimator than consensus Monte Carlo (Theorem 1). And the paper only draws a comparison with that sub-sampling method, rather than covering other approaches to the problem, maybe because this is the most natural connection, one approach being the k-th power of the other approach.

“…we will show that [importance sampling] is unnecessary in many instances…” (p.6)

An obvious question that stems from the approach is the call for importance sampling, since the numerator of the importance sampler involves the full likelihood which is unavailable in most instances when sub-sampled MCMC is required. I may have missed the part of the paper where the above statement is discussed, but the only realistic example discussed therein is the Bayesian regression tree (BART) of Chipman et al. (1998). Which indeed constitutes a challenging if one-dimensional example, but also one that requires delicate tuning that leads to cancelling importance weights but which may prove delicate to extrapolate to other models.