## why is the likelihood not a pdf?

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , on January 4, 2021 by xi'an

The return of an old debate on X validated. Can the likelihood be a pdf?! Even though there exist cases where a [version of the] likelihood function shows such a symmetry between the sufficient statistic and the parameter, as e.g. in the Normal mean model, that they are somewhat exchangeable w.r.t. the same measure, the question is somewhat meaningless for a number of reasons that we can all link to Ronald Fisher:

1. when defining the likelihood function, Fisher (in his 1912 undergraduate memoir!) warns against integrating it w.r.t. the parameter: “the integration with respect to m is illegitimate and has no definite meaning with respect to inverse probability”. The likelihood is “is a relative probability only, suitable to compare point with point, but incapable of being interpreted as a probability distribution over a region, or of giving any estimate of absolute probability.” And again in 1922: “[the likelihood] is not a differential element, and is incapable of being integrated: it is assigned to a particular point of the range of variation, not to a particular element of it”.
2. He introduced the term “likelihood” especially to avoid the confusion: “I perceive that the word probability is wrongly used in such a connection: probability is a ratio of frequencies, and about the frequencies of such values we can know nothing whatever (…) I suggest that we may speak without confusion of the likelihood of one value of p being thrice the likelihood of another (…) likelihood is not here used loosely as a synonym of probability, but simply to express the relative frequencies with which such values of the hypothetical quantity p would in fact yield the observed sample”.
3. Another point he makes repeatedly (both in 1912 and 1922) is the lack of invariance of the probability measure obtained by attaching a dθ to the likelihood function L(θ) and normalising it into a density: while the likelihood “is entirely unchanged by any [one-to-one] transformation”, this definition of a probability distribution is not. Fisher actually distanced himself from a Bayesian “uniform prior” throughout the 1920’s.

which sums up as the urge to never neglect the dominating measure!

## optimal choice among MCMC kernels

Posted in Statistics with tags , , , , , , , , , , on March 14, 2019 by xi'an

Last week in Siem Reap, Florian Maire [who I discovered originates from a Norman town less than 10km from my hometown!] presented an arXived joint work with Pierre Vandekerkhove at the Data Science & Finance conference in Cambodia that considers the following problem: Given a large collection of MCMC kernels, how to pick the best one and how to define what best means. Going by mixtures is a default exploration of the collection, as shown in (Tierney) 1994 for instance since this improves on both kernels (esp. when each kernel is not irreducible on its own!). This paper considers a move to local weights in the mixture, weights that are not estimated from earlier simulations, contrary to what I first understood.

As made clearer in the paper the focus is on filamentary distributions that are concentrated nearby lower-dimension sets or manifolds Since then the components of the kernel collections can be restricted to directions of these manifolds… Including an interesting case of a 2-D highly peaked target where converging means mostly simulating in x¹ and covering the target means mostly simulating in x². Exhibiting a schizophrenic tension between the two goals. Weight locally dependent means correction by Metropolis step, with cost O(n). What of Rao-Blackwellisation of these mixture weights, from weight x transition to full mixture, as in our PMC paper? Unclear to me as well [during the talk] is the use in the mixture of basic Metropolis kernels, which are not absolutely continuous, because of the Dirac mass component. But this is clarified by Section 5 in the paper. A surprising result from the paper (Corollary 1) is that the use of local weights ω(i,x) that depend on the current value of the chain does jeopardize the stationary measure π(.) of the mixture chain. Which may be due to the fact that all components of the mixture are already π-invariant. Or that the index of the kernel constitutes an auxiliary (if ancillary)  variate. (Algorithm 1 in the paper reminds me of delayed acceptance. Making me wonder if computing time should be accounted for.) A final question I briefly discussed with Florian is the extension to weights that are automatically constructed from the simulations and the target.

Posted in Books, pictures, Statistics with tags , , , , , , , , , , on January 28, 2019 by xi'an

An interesting paper came out on arXiv in early December, written by Michael Brand from Monash. It is about risk-adverse Bayes estimators, which are defined as avoiding the use of loss functions (although why avoiding loss functions is not made very clear in the paper). Close to MAP estimates, they bypass the dependence of said MAPs on parameterisation by maximising instead π(θ|x)/√I(θ), which is invariant by reparameterisation if not by a change of dominating measure. This form of MAP estimate is called the Wallace-Freeman (1987) estimator [of which I never heard].

The formal definition of a risk-adverse estimator is still based on a loss function in order to produce a proper version of the probability to be “wrong” in a continuous environment. The difference between estimator and true value θ, as expressed by the loss, is enlarged by a scale factor k pushed to infinity. Meaning that differences not in the immediate neighbourhood of zero are not relevant. In the case of a countable parameter space, this is essentially producing the MAP estimator. In the continuous case, for “well-defined” and “well-behaved” loss functions and estimators and density, including an invariance to parameterisation as in my own intrinsic losses of old!, which the author calls likelihood-based loss function,  mentioning f-divergences, the resulting estimator(s) is a Wallace-Freeman estimator (of which there may be several). I did not get very deep into the study of the convergence proof, which seems to borrow more from real analysis à la Rudin than from functional analysis or measure theory, but keep returning to the apparent dependence of the notion on the dominating measure, which bothers me.

## O’Bayes in action

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , , , on November 7, 2017 by xi'an

My next-door colleague [at Dauphine] François Simenhaus shared a paradox [to be developed in an incoming test!] with Julien Stoehr and I last week, namely that, when selecting the largest number between a [observed] and b [unobserved], drawing a random boundary on a [meaning that a is chosen iff a is larger than this boundary] increases the probability to pick the largest number above ½2…

When thinking about it in the wretched RER train [train that got immobilised for at least two hours just a few minutes after I went through!, good luck to the passengers travelling to the airport…] to De Gaulle airport, I lost the argument: if a<b, the probability [for this random bound] to be larger than a and hence for selecting b is 1-Φ(a), while, if a>b, the probability [of winning] is Φ(a). Hence the only case when the probability is ½ is when a is the median of this random variable. But, when discussing the issue further with Julien, I exposed an interesting non-informative prior characterisation. Namely, if I assume a,b to be iid U(0,M) and set an improper prior 1/M on M, the conditional probability that b>a given a is ½. Furthermore, the posterior probability to pick the right [largest] number with François’s randomised rule is also ½, no matter what the distribution of the random boundary is. Now, the most surprising feature of this coffee room derivation is that these properties only hold for the prior 1/M. Any other power of M will induce an asymmetry between a and b. (The same properties hold when a,b are iid Exp(M).)  Of course, this is not absolutely unexpected since 1/M is the invariant prior and since the “intuitive” symmetry only holds under this prior. Power to O’Bayes!

When discussing again the matter with François yesterday, I realised I had changed his wording of the puzzle. The original setting is one with two cards hiding the unknown numbers a and b and of a player picking one of the cards. If the player picks a card at random, there is indeed a probability of ½ of picking the largest number. If the decision to switch or not depends on an independent random draw being larger or smaller than the number on the observed card, the probability to get max(a,b) in the end hits 1 when this random draw falls into (a,b) and remains ½ outside (a,b). Randomisation pays.

## Pitman medal for Kerrie Mengersen

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , on December 20, 2016 by xi'an

My friend and co-author of many years, Kerrie Mengersen, just received the 2016 Pitman Medal, which is the prize of the Statistical Society of Australia. Congratulations to Kerrie for a well-deserved reward of her massive contributions to Australian, Bayesian, computational, modelling statistics, and to data science as a whole. (In case you wonder about the picture above, she has not yet lost the medal, but is instead looking for jaguars in the Amazon.)

This medal is named after EJG Pitman, Australian probabilist and statistician, whose name is attached to an estimator, a lemma, a measure of efficiency, a test, and a measure of comparison between estimators. His estimator is the best equivariant (or invariant) estimator, which can be expressed as a Bayes estimator under the relevant right Haar measure, despite having no Bayesian motivation to start with. His lemma is the Pitman-Koopman-Darmois lemma, which states that outside exponential families, sufficient is essentially useless (except for exotic distributions like the Uniform distributions). Darmois published the result first in 1935, but in French in the Comptes Rendus de l’Académie des Sciences. And the measure of comparison is Pitman nearness or closeness, on which I wrote a paper with my friends Gene Hwang and Bill Strawderman, paper that we thought was the final paper on the measure as it was pointing out several majors deficiencies with this concept. But the literature continued to grow after that..!