## O’Bayes 19/3

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , on July 2, 2019 by xi'an

Nancy Reid gave the first talk of the [Canada] day, in an impressive comparison of all approaches in statistics that involve a distribution of sorts on the parameter, connected with the presentation she gave at BFF4 in Harvard two years ago, including safe Bayes options this time. This was related to several (most?) of the talks at the conference, given the level of worry (!) about the choice of a prior distribution. But the main assessment of the methods still seemed to be centred on a frequentist notion of calibration, meaning that epistemic interpretations of probabilities and hence most of Bayesian answers were disqualified from the start.

In connection with Nancy’s focus, Peter Hoff’s talk also concentrated on frequency valid confidence intervals in (linear) hierarchical models. Using prior information or structure to build better and shrinkage-like confidence intervals at a given confidence level. But not in the decision-theoretic way adopted by George Casella, Bill Strawderman and others in the 1980’s. And also making me wonder at the relevance of contemplating a fixed coverage as a natural goal. Above, a side result shown by Peter that I did not know and which may prove useful for Monte Carlo simulation.

Jaeyong Lee worked on a complex model for banded matrices that starts with a regular Wishart prior on the unrestricted space of matrices, computes the posterior and then projects this distribution onto the constrained subspace. (There is a rather consequent literature on this subject, including works by David Dunson in the past decade of which I was unaware.) This is a smart demarginalisation idea but I wonder a wee bit at the notion as the constrained space has measure zero for the larger model. This could explain for the resulting posterior not being a true posterior for the constrained model in the sense that there is no prior over the constrained space that could return such a posterior. Another form of marginalisation paradox. The crux of the paper is however about constructing a functional form of minimaxity. In his discussion of the paper, Guido Consonni provided a representation of the post-processed posterior (P³) that involves the Dickey-Savage ratio, sort of, making me more convinced of the connection.

As a lighter aside, one item of local information I should definitely have broadcasted more loudly and long enough in advance to the conference participants is that the University of Warwick is not located in ye olde town of Warwick, where there is no university, but on the outskirts of the city of Coventry, but not to be confused with the University of Coventry. Located in Coventry.

## posterior distribution missing the MLE

Posted in Books, Kids, pictures, Statistics with tags , , , , , , , on April 25, 2019 by xi'an

An X validated question as to why the MLE is not necessarily (well) covered by a posterior distribution. Even for a flat prior… Which in restrospect highlights the fact that the MLE (and the MAP) are invasive species in a Bayesian ecosystem. Since they do not account for the dominating measure. And hence do not fare well under reparameterisation. (As a very much to the side comment, I also managed to write an almost identical and simultaneous answer to the first answer to the question.)

## dominating measure

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , on March 21, 2019 by xi'an

Yet another question on X validated reminded me of a discussion I had once  with Jay Kadane when visiting Carnegie Mellon in Pittsburgh. Namely the fundamentally ill-posed nature of conjugate priors. Indeed, when considering the definition of a conjugate family as being a parameterised family Þ of distributions over the parameter space Θ stable under transform to the posterior distribution, this property is completely dependent (if there is such a notion as completely dependent!) on the dominating measure adopted on the parameter space Θ. Adopted is the word as there is no default, reference, natural, &tc. measure that promotes one specific measure on Θ as being the dominating measure. This is a well-known difficulty that also sticks out in most “objective Bayes” problems, as well as with maximum entropy priors. This means for instance that, while the Gamma distributions constitute a conjugate family for a Poisson likelihood, so do the truncated Gamma distributions. And so do the distributions which density (against a Lebesgue measure over an arbitrary subset of (0,∞)) is the product of a Gamma density by an arbitrary function of θ. I readily acknowledge that the standard conjugate priors as introduced in every Bayesian textbook are standard because they facilitate (to a certain extent) posterior computations. But, just like there exist an infinity of MaxEnt priors associated with an infinity of dominating measures, there exist an infinity of conjugate families, once more associated with an infinity of dominating measures. And the fundamental reason is that the sampling model (which induces the shape of the conjugate family) does not provide a measure on the parameter space Θ.

Posted in Books, pictures, Statistics with tags , , , , , , , , , , on January 28, 2019 by xi'an

An interesting paper came out on arXiv in early December, written by Michael Brand from Monash. It is about risk-adverse Bayes estimators, which are defined as avoiding the use of loss functions (although why avoiding loss functions is not made very clear in the paper). Close to MAP estimates, they bypass the dependence of said MAPs on parameterisation by maximising instead π(θ|x)/√I(θ), which is invariant by reparameterisation if not by a change of dominating measure. This form of MAP estimate is called the Wallace-Freeman (1987) estimator [of which I never heard].

The formal definition of a risk-adverse estimator is still based on a loss function in order to produce a proper version of the probability to be “wrong” in a continuous environment. The difference between estimator and true value θ, as expressed by the loss, is enlarged by a scale factor k pushed to infinity. Meaning that differences not in the immediate neighbourhood of zero are not relevant. In the case of a countable parameter space, this is essentially producing the MAP estimator. In the continuous case, for “well-defined” and “well-behaved” loss functions and estimators and density, including an invariance to parameterisation as in my own intrinsic losses of old!, which the author calls likelihood-based loss function,  mentioning f-divergences, the resulting estimator(s) is a Wallace-Freeman estimator (of which there may be several). I did not get very deep into the study of the convergence proof, which seems to borrow more from real analysis à la Rudin than from functional analysis or measure theory, but keep returning to the apparent dependence of the notion on the dominating measure, which bothers me.

## questioning the Bayesian choice

Posted in Books, Kids, Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , on November 13, 2018 by xi'an

When I woke up (very early!) in Oaxaca, on Remembrance Day, I noticed a long question about The Bayesian Choice contents on X validated. The originator of the question (OQ) was puzzled about several statements in the book on maximum entropy  methods, from the nature of the moment constraints, outside standard moments, to the existence of a maximum entropy prior (as exemplified by quantiles), to the deeper issue of the ultimate arbitrariness of “the” maximum entropy prior since it is also determined by the choice of the dominating measure. (Never neglect the dominating measure!) A more challenging concept, hopefully making it home with the OQ.

## are there a frequentist and a Bayesian likelihoods?

Posted in Statistics with tags , , , , , , , , , , on June 7, 2018 by xi'an

A question that came up on X validated and led me to spot rather poor entries in Wikipedia about both the likelihood function and Bayes’ Theorem. Where unnecessary and confusing distinctions are made between the frequentist and Bayesian versions of these notions. I have already discussed the later (Bayes’ theorem) a fair amount here. The discussion about the likelihood is quite bemusing, in that the likelihood function is the … function of the parameter equal to the density indexed by this parameter at the observed value.

“What we can find from a sample is the likelihood of any particular value of r, if we define the likelihood as a quantity proportional to the probability that, from a population having the particular value of r, a sample having the observed value of r, should be obtained.” R.A. Fisher, On the “probable error’’ of a coefficient of correlation deduced from a small sample. Metron 1, 1921, p.24

By mentioning an informal side to likelihood (rather than to likelihood function), and then stating that the likelihood is not a probability in the frequentist version but a probability in the Bayesian version, the W page makes a complete and unnecessary mess. Whoever is ready to rewrite this introduction is more than welcome! (Which reminded me of an earlier question also on X validated asking why a common reference measure was needed to define a likelihood function.)

This also led me to read a recent paper by Alexander Etz, whom I met at E.J. Wagenmakers‘ lab in Amsterdam a few years ago. Following Fisher, as Jeffreys complained about

“..likelihood, a convenient term introduced by Professor R.A. Fisher, though in his usage it is sometimes multiplied by a constant factor. This is the probability of the observations given the original information and the hypothesis under discussion.” H. Jeffreys, Theory of Probability, 1939, p.28

Alexander defines the likelihood up to a constant, which causes extra-confusion, for free!, as there is no foundational reason to introduce this degree of freedom rather than imposing an exact equality with the density of the data (albeit with an arbitrary choice of dominating measure, never neglect the dominating measure!). The paper also repeats the message that the likelihood is not a probability (density, missing in the paper). And provides intuitions about maximum likelihood, likelihood ratio and Wald tests. But does not venture into a separate definition of the likelihood, being satisfied with the fundamental notion to be plugged into the magical formula

posteriorprior×likelihood

## MAP or mean?!

Posted in Statistics, Travel, University life with tags , , , on March 5, 2014 by xi'an

“A frequent matter of debate in Bayesian inversion is the question, which of the two principle point-estimators, the maximum-a-posteriori (MAP) or the conditional mean (CM) estimate is to be preferred.”

An interesting topic for this arXived paper by Burger and Lucka that I (also) read in the plane to Montréal, even though I do not share the concern that we should pick between those two estimators (only or at all), since what matters is the posterior distribution and the use one makes of it. I thus disagree there is any kind of a “debate concerning the choice of point estimates”. If Bayesian inference reduces to producing a point estimate, this is a regularisation technique and the Bayesian interpretation is both incidental and superfluous.

Maybe the most interesting result in the paper is that the MAP is expressed as a proper Bayes estimator! I was under the opposite impression, mostly because the folklore (and even The Bayesian Core)  have it that it corresponds to a 0-1 loss function does not hold for continuous parameter spaces and also because it seems to conflict with the results of Druihlet and Marin (BA, 2007), who point out that the MAP ultimately depends on the choice of the dominating measure. (Even though the Lebesgue measure is implicitly chosen as the default.) The authors of this arXived paper start with a distance based on the prior; called the Bregman distance. Which may be the quadratic or the entropy distance depending on the prior. Defining a loss function that is a mix of this Bregman distance and of the quadratic distance

$||K(\hat u-u)||^2+2D_\pi(\hat u,u)$

produces the MAP as the Bayes estimator. So where did the dominating measure go? In fact, nowhere: both the loss function and the resulting estimator are clearly dependent on the choice of the dominating measure… (The loss depends on the prior but this is not a drawback per se!)