Archive for loss function

the three i’s of poverty

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , on September 15, 2019 by xi'an

Today I made a “quick” (10h door to door!) round trip visit to Marseille (by train) to take part in the PhD thesis defense (committee) of Edwin Fourrier-Nicolaï, which title was Poverty, inequality and redistribution: an econometric approach. While this was mainly a thesis in economics, meaning defending some theory on inequalities based on East German data, there were Bayesian components in the thesis that justified (to some extent!) my presence in the jury. Especially around mixture estimation by Gibbs sampling. (On which I started working almost exactly 30 years ago, when I joined Paris 6 and met  Gilles Celeux and Jean Diebolt.) One intriguing [for me] question stemmed from this defense, namely the notion of a Bayesian estimation of a three i’s of poverty (TIP) curve. The three i’s stand for incidence, intensity, and inequality, as, introduced in Jenkins and Lambert (1997), this curve measure the average income loss from the poverty level for the 100p% lower incomes, when p varies between 0 and 1. It thus depends on the distribution F of the incomes and when using a mixture distribution its computation requires a numerical cdf inversion to determine the income p-th quantile. A related question is thus on how to define a Bayesian estimate of the TIP curve. Using an average over the values of an MCMC sample does not sound absolutely satisfactory since the upper bound in the integral varies for each realisation of the parameter. The use of another estimate would however require a specific loss function, an issue not discussed in the thesis.

admissible estimators that are not Bayes

Posted in Statistics with tags , , , , , , on December 30, 2017 by xi'an

A question that popped up on X validated made me search a little while for point estimators that are both admissible (under a certain loss function) and not generalised Bayes (under the same loss function), before asking Larry Brown, Jim Berger, or Ed George. The answer came through Larry’s book on exponential families, with the two examples attached. (Following our 1989 collaboration with Roger Farrell at Cornell U, I knew about the existence of testing procedures that were both admissible and not Bayes.) The most surprising feature is that the associated loss function is strictly convex as I would have thought that a less convex loss would have helped to find such counter-examples.

MAP as Bayes estimators

Posted in Books, Kids, Statistics with tags , , , , on November 30, 2016 by xi'an

screenshot_20161122_123607Robert Bassett and Julio Deride just arXived a paper discussing the position of MAPs within Bayesian decision theory. A point I have discussed extensively on the ‘Og!

“…we provide a counterexample to the commonly accepted notion of MAP estimators as a limit of Bayes estimators having 0-1 loss.”

The authors mention The Bayesian Choice stating this property without further precautions and I completely agree to being careless in this regard! The difficulty stands with the limit of the maximisers being not necessarily the maximiser of the limit. The paper includes an example to this effect, with a prior as above,  associated with a sampling distribution that does not depend on the parameter. The sufficient conditions proposed therein are that the posterior density is almost surely proper or quasiconcave.

This is a neat mathematical characterisation that cleans this “folk theorem” about MAP estimators. And for which the authors are to be congratulated! However, I am not very excited by the limiting property, whether it holds or not, as I have difficulties conceiving the use of a sequence of losses in a mildly realistic case. I rather prefer the alternate characterisation of MAP estimators by Burger and Lucka as proper Bayes estimators under another type of loss function, albeit a rather artificial one.

ISBA 2016 [#3]

Posted in pictures, Running, Statistics, Travel, University life, Wines with tags , , , , , , , , , , on June 16, 2016 by xi'an

Among the sessions I attended yesterday, I really liked the one on robustness and model mispecification. Especially the talk by Steve McEachern on Bayesian inference based on insufficient statistics, with a striking graph of the degradation of the Bayes factor as the prior variance increases. I sadly had no time to grab a picture of the graph, which compared this poor performance against a stable rendering when using a proper summary statistic. It clearly relates to our work on ABC model choice, as well as to my worries about the Bayes factor, so this explains why I am quite excited about this notion of restricted inference. In this session, Chris Holmes also summarised his two recent papers on loss-based inference, which I discussed here in a few posts, including the Statistical Science discussion Judith and I wrote recently. I also went to the j-ISBA [section] session which was sadly under-attended, maybe due to too many parallel sessions, maybe due to the lack of unifying statistical theme.

likelihood-free Bayesian inference on the minimum clinically important difference

Posted in Books, Statistics, University life with tags , , , , , on January 20, 2015 by xi'an

Last week, Likelihood-free Bayesian inference on the minimum clinically important difference was arXived by Nick Syring and Ryan Martin and I read it over the weekend, slowly coming to the realisation that their [meaning of] “likelihood free” was not my [meaning of] “likelihood free”, namely that it has nothing to do with ABC! The idea therein is to create a likelihood out of a loss function, in the spirit of Bassiri, Holmes and Walker, the loss being inspired here by a clinical trial concept, the minimum clinically important difference, defined as

\theta^* = \min_\theta\mathbb{P}(Y\ne\text{sign}(X-\theta))

which defines a loss function per se when considering the empirical version. In clinical trials, Y is a binary outcome and X a vector of explanatory variables. This model-free concept avoids setting a joint distribution  on the pair (X,Y), since creating a distribution on a large vector of covariates is always an issue. As a marginalia, the authors actually mention our MCMC book in connection with a logistic regression (Example 7.11) and for a while I thought we had mentioned MCID therein, realising later it was a standard description of MCMC for logistic models.

The central and interesting part of the paper is obviously defining the likelihood-free posterior as

\pi_n(\theta) \propto \exp\{-n L_n(\theta) \}\pi(\theta)

The authors manage to obtain the rate necessary for the estimation to be asymptotically consistent, which seems [to me] to mean that a better representation of the likelihood-free posterior should be

\pi_n(\theta) \propto \exp\{-n^{-2/5} L_n(\theta) \}\pi(\theta)

(even though this rescaling does not appear verbatim in the paper). This is quite an interesting application of the concept developed by Bissiri, Holmes and Walker, even though it also illustrates the difficulty of defining a specific prior, given that the minimised target above can be transformed by an arbitrary increasing function. And the mathematical difficulty in finding a rate.

label switching in Bayesian mixture models

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on October 31, 2014 by xi'an

cover of Mixture Estimation and ApplicationsA referee of our paper on approximating evidence for mixture model with Jeong Eun Lee pointed out the recent paper by Carlos Rodríguez and Stephen Walker on label switching in Bayesian mixture models: deterministic relabelling strategies. Which appeared this year in JCGS and went beyond, below or above my radar.

Label switching is an issue with mixture estimation (and other latent variable models) because mixture models are ill-posed models where part of the parameter is not identifiable. Indeed, the density of a mixture being a sum of terms

\sum_{j=1}^k \omega_j f(y|\theta_i)

the parameter (vector) of the ω’s and of the θ’s is at best identifiable up to an arbitrary permutation of the components of the above sum. In other words, “component #1 of the mixture” is not a meaningful concept. And hence cannot be estimated.

This problem has been known for quite a while, much prior to EM and MCMC algorithms for mixtures, but it is only since mixtures have become truly estimable by Bayesian approaches that the debate has grown on this issue. In the very early days, Jean Diebolt and I proposed ordering the components in a unique way to give them a meaning. For instant, “component #1” would then be the component with the smallest mean or the smallest weight and so on… Later, in one of my favourite X papers, with Gilles Celeux and Merrilee Hurn, we exposed the convergence issues related with the non-identifiability of mixture models, namely that the posterior distributions were almost always multimodal, with a multiple of k! symmetric modes in the case of exchangeable priors, and therefore that Markov chains would have trouble to visit all those modes in a symmetric manner, despite the symmetry being guaranteed from the shape of the posterior. And we conclude with the slightly provocative statement that hardly any Markov chain inferring about mixture models had ever converged! In parallel, time-wise, Matthew Stephens had completed a thesis at Oxford on the same topic and proposed solutions for relabelling MCMC simulations in order to identify a single mode and hence produce meaningful estimators. Giving another meaning to the notion of “component #1”.

And then the topic began to attract more and more researchers, being both simple to describe and frustrating in its lack of definitive answer, both from simulation and inference perspectives. Rodriguez’s and Walker’s paper provides a survey on the label switching strategies in the Bayesian processing of mixtures, but its innovative part is in deriving a relabelling strategy. Which consists of finding the optimal permutation (at each iteration of the Markov chain) by minimising a loss function inspired from k-means clustering. Which is connected with both Stephens’ and our [JASA, 2000] loss functions. The performances of this new version are shown to be roughly comparable with those of other relabelling strategies, in the case of Gaussian mixtures. (Making me wonder if the choice of the loss function is not favourable to Gaussian mixtures.) And somehow faster than Stephens’ Kullback-Leibler loss approach.

“Hence, in an MCMC algorithm, the indices of the parameters can permute multiple times between iterations. As a result, we cannot identify the hidden groups that make [all] ergodic averages to estimate characteristics of the components useless.”

One section of the paper puzzles me, albeit it does not impact the methodology and the conclusions. In Section 2.1 (p.27), the authors consider the quantity

p(z_i=j|{\mathbf y})

which is the marginal probability of allocating observation i to cluster or component j. Under an exchangeable prior, this quantity is uniformly equal to 1/k for all observations i and all components j, by virtue of the invariance under permutation of the indices… So at best this can serve as a control variate. Later in Section 2.2 (p.28), the above sentence does signal a problem with those averages but it seem to attribute it to MCMC behaviour rather than to the invariance of the posterior (or to the non-identifiability of the components per se). At last, the paper mentions that “given the allocations, the likelihood is invariant under permutations of the parameters and the allocations” (p.28), which is not correct, since eqn. (8)

f(y_i|\theta_{\sigma(z_i)}) =f(y_i|\theta_{\tau(z_i)})

does not hold when the two permutations σ and τ give different images of zi

MAP or mean?!

Posted in Statistics, Travel, University life with tags , , , on March 5, 2014 by xi'an

“A frequent matter of debate in Bayesian inversion is the question, which of the two principle point-estimators, the maximum-a-posteriori (MAP) or the conditional mean (CM) estimate is to be preferred.”

An interesting topic for this arXived paper by Burger and Lucka that I (also) read in the plane to Montréal, even though I do not share the concern that we should pick between those two estimators (only or at all), since what matters is the posterior distribution and the use one makes of it. I thus disagree there is any kind of a “debate concerning the choice of point estimates”. If Bayesian inference reduces to producing a point estimate, this is a regularisation technique and the Bayesian interpretation is both incidental and superfluous.

Maybe the most interesting result in the paper is that the MAP is expressed as a proper Bayes estimator! I was under the opposite impression, mostly because the folklore (and even The Bayesian Core)  have it that it corresponds to a 0-1 loss function does not hold for continuous parameter spaces and also because it seems to conflict with the results of Druihlet and Marin (BA, 2007), who point out that the MAP ultimately depends on the choice of the dominating measure. (Even though the Lebesgue measure is implicitly chosen as the default.) The authors of this arXived paper start with a distance based on the prior; called the Bregman distance. Which may be the quadratic or the entropy distance depending on the prior. Defining a loss function that is a mix of this Bregman distance and of the quadratic distance

||K(\hat u-u)||^2+2D_\pi(\hat u,u)

produces the MAP as the Bayes estimator. So where did the dominating measure go? In fact, nowhere: both the loss function and the resulting estimator are clearly dependent on the choice of the dominating measure… (The loss depends on the prior but this is not a drawback per se!)