## scaling the Gibbs posterior credible regions

Posted in Books, Statistics, University life with tags , , , , , , , on September 11, 2015 by xi'an

“The challenge in implementation of the Gibbs posterior is that it depends on an unspecified scale (or inverse temperature) parameter.”

A new paper by Nick Syring and Ryan Martin was arXived today on the same topic as the one I discussed last January. The setting is the same as with empirical likelihood, namely that the distribution of the data is not specified, while parameters of interest are defined via moments or, more generally, a minimising a loss function. A pseudo-likelihood can then be constructed as a substitute to the likelihood, in the spirit of Bissiri et al. (2013). It is called a “Gibbs posterior” distribution in this paper. So the “Gibbs” in the title has no link with the “Gibbs” in Gibbs sampler, since inference is conducted with respect to this pseudo-posterior. Somewhat logically (!), as n grows to infinity, the pseudo- posterior concentrates upon the pseudo-true value of θ minimising the expected loss, hence asymptotically resembles to the M-estimator associated with this criterion. As I pointed out in the discussion of Bissiri et al. (2013), one major hurdle when turning a loss into a log-likelihood is that it is at best defined up to a scale factor ω. The authors choose ω so that the Gibbs posterior

$\exp\{-\omega n l_n(\theta,x) \}\pi(\theta)$

is well-calibrated. Where ln is the empirical averaged loss. So the Gibbs posterior is part of the matching prior collection. In practice the authors calibrate ω by a stochastic optimisation iterative process, with bootstrap on the side to evaluate coverage. They briefly consider empirical likelihood as an alternative, on a median regression example, where they show that their “Gibbs confidence intervals (…) are clearly the best” (p.12). Apart from the relevance of being “well-calibrated”, and the asymptotic nature of the results. and the dependence on the parameterisation via the loss function, one may also question the possibility of using this approach in large dimensional cases where all of or none of the parameters are of interest.

## Bureau international des poids et mesures [bayésiennes?]

Posted in pictures, Statistics, Travel with tags , , , , , , , , , , , , , on June 19, 2015 by xi'an

The workshop at the BIPM on measurement uncertainty was certainly most exciting, first by its location in the Parc de Saint Cloud in classical buildings overlooking the Seine river in a most bucolic manner…and second by its mostly Bayesian flavour. The recommendations that the workshop addressed are about revisions in the current GUM, which stands for the Guide to the Expression of Uncertainty in Measurement. The discussion centred on using a more Bayesian approach than in the earlier version, with the organisers of the workshop and leaders of the revision apparently most in favour of that move. “Knowledge-based pdfs” came into the discussion as an attractive notion since it rings a Bayesian bell, especially when associated with probability as a degree of belief and incorporating the notion of an a priori probability distribution. And propagation of errors. Or even more when mentioning the removal of frequentist validations. What I gathered from the talks is the perspective drifting away from central limit approximations to more realistic representations, calling for Monte Carlo computations. There is also a lot I did not get about conventions, codes and standards. Including a short debate about the different meanings on Monte Carlo, from simulation technique to calculation method (as for confidence intervals). And another discussion about replacing the old formula for estimating sd from the Normal to the Student’s t case. A change that remains highly debatable since the Student’s t assumption is as shaky as the Normal one. What became clear [to me] during the meeting is that a rather heated debate is currently taking place about the need for a revision, with some members of the six (?) organisations involved arguing against Bayesian or linearisation tools.

This became even clearer during our frequentist versus Bayesian session with a first talk so outrageously anti-Bayesian it was hilarious! Among other things, the notion that “fixing” the data was against the principles of physics (the speaker was a physicist), that the only randomness in a Bayesian coin tossing was coming from the prior, that the likelihood function was a subjective construct, that the definition of the posterior density was a generalisation of Bayes’ theorem [generalisation found in… Bayes’ 1763 paper then!], that objective Bayes methods were inconsistent [because Jeffreys’ prior produces an inadmissible estimator of μ²!], that the move to Bayesian principles in GUM would cost the New Zealand economy 5 billion dollars [hopefully a frequentist estimate!], &tc., &tc. The second pro-frequentist speaker was by comparison much much more reasonable, although he insisted on showing Bayesian credible intervals do not achieve a nominal frequentist coverage, using a sort of fiducial argument distinguishing x=X+ε from X=x+ε that I missed… A lack of achievement that is fine by my standards. Indeed, a frequentist confidence interval provides a coverage guarantee either for a fixed parameter (in which case the Bayesian approach achieves better coverage by constant updating) or a varying parameter (in which case the frequency of proper inclusion is of no real interest!). The first Bayesian speaker was Tony O’Hagan, who summarily shred the first talk to shreds. And also criticised GUM2 for using reference priors and maxent priors. I am afraid my talk was a bit too exploratory for the audience (since I got absolutely no question!) In retrospect, I should have given an into to reference priors.

An interesting specificity of a workshop on metrology and measurement is that they are hard stickers to schedule, starting and finishing right on time. When a talk finished early, we waited until the intended time to the next talk. Not even allowing for extra discussion. When the only overtime and Belgian speaker ran close to 10 minutes late, I was afraid he would (deservedly) get lynched! He escaped unscathed, but may (and should) not get invited again..!

## Bayesian programming [book review]

Posted in Books, Kids, pictures, Statistics, University life with tags , , , , , , , , , on March 3, 2014 by xi'an

“We now think the Bayesian Programming methodology and tools are reaching maturity. The goal of this book is to present them so that anyone is able to use them. We will, of course, continue to improve tools and develop new models. However, pursuing the idea that probability is an alternative to Boolean logic, we now have a new important research objective, which is to design specific hsrdware, inspired from biology, to build a Bayesian computer.”(p.xviii)

On the plane to and from Montpellier, I took an extended look at Bayesian Programming a CRC Press book recently written by Pierre Bessière, Emmanuel Mazer, Juan-Manuel Ahuactzin, and Kamel Mekhnacha. (Very nice picture of a fishing net on the cover, by the way!) Despite the initial excitement at seeing a book which final goal was to achieve a Bayesian computer, as demonstrated by the above quote, I however soon found the book too arid to read due to its highly formalised presentation… The contents are clear indications that the approach is useful as they illustrate the use of Bayesian programming in different decision-making settings, including a collection of Python codes, so it brings an answer to the what but it somehow misses the how in that the construction of the priors and the derivation of the posteriors is not explained in a way one could replicate.

“A modeling methodology is not sufficient to run Bayesian programs. We also require an efficient Bayesian inference engine to automate the probabilistic calculus. This assumes we have a collection of inference algorithms adapted and tuned to more or less specific models and a software architecture to combine them in a coherent and unique tool.” (p.9)

For instance, all models therein are described via the curly brace formalism summarised by

which quickly turns into an unpalatable object, as in this example taken from the online PhD thesis of Gabriel Synnaeve (where he applied Bayesian programming principles to a MMORPG called StarCraft and developed an AI (or bot) able to play BroodwarBotQ)

thesis that I found most interesting!

“Consequently, we have 21 × 16 = 336 bell-shaped distributions and we have 2 × 21 × 16 = 772 free parameters: 336 means and 336 standard deviations.¨(p.51)

Now, getting back to the topic of the book, I can see connections with statistical problems and models, and not only via the application of Bayes’ theorem, when the purpose (or Question) is to take a decision, for instance in a robotic action. I still remain puzzled by the purpose of the book, since it starts with very low expectations on the reader, but hurries past notions like Kalman filters and Metropolis-Hastings algorithms in a few paragraphs. I do not get some of the details, like this notion of a discretised Gaussian distribution (I eventually found the place where the 772 prior parameters are “learned” in a phase called “identification”.)

“Thanks to conditional independence the curse of dimensionality has been broken! What has been shown to be true here for the required memory space is also true for the complexity of inferences. Conditional independence is the principal tool to keep the calculation tractable. Tractability of Bayesian inference computation is of course a major concern as it has been proved NP-hard (Cooper, 1990).”(p.74)

The final chapters (Chap. 14 on “Bayesian inference algorithms revisited”, Chap. 15 on “Bayesian learning revisited” and  Chap. 16 on “Frequently asked questions and frequently argued matters” [!]) are definitely those I found easiest to read and relate to. With mentions made of conjugate priors and of the EM algorithm as a (Bayes) classifier. The final chapter mentions BUGS, Hugin and… Stan! Plus a sequence of 23 PhD theses defended on Bayesian programming for robotics in the past 20 years. And explains the authors’ views on the difference between Bayesian programming and Bayesian networks (“any Bayesian network can be represented in the Bayesian programming formalism, but the opposite is not true”, p.316), between Bayesian programming and probabilistic programming (“we do not search to extend classical languages but rather to replace them by a new programming approach based on probability”, p.319), between Bayesian programming and Bayesian modelling (“Bayesian programming goes one step further”, p.317), with a further (self-)justification of why the book sticks to discrete variables, and further more philosophical sections referring to Jaynes and the principle of maximum entropy.

“The “objectivity” of the subjectivist approach then lies in the fact that two different subjects with same preliminary knowledge and same observations will inevitably reach the same conclusions.”(p.327)

Bayesian Programming thus provides a good snapshot of (or window on) what one can achieve in uncertain environment decision-making with Bayesian techniques. It shows a long-term reflection on those notions by Pierre Bessière, his colleagues and students. The topic is most likely too remote from my own interests for the above review to be complete. Therefore, if anyone is interested in reviewing any further this book for CHANCE, before I send the above to the journal, please contact me. (Usual provisions apply.)

## finite mixture models [book review]

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , , on February 17, 2014 by xi'an

Here is a review of Finite Mixture Models (2000) by Geoff McLachlan & David Peel that I wrote aeons ago (circa 1999), supposedly for JASA, which lost first the files and second the will to publish it. As I was working with my student today, I mentioned the book to her and decided to publish it here, if only because I think the book deserved a positive review, even after all those years! (Since then, Sylvia Frühwirth-Schnatter published Finite Mixture and Markov Switching Models (2004), which is closer to my perspective on the topic and that I would more naturally recommend.)

Mixture modeling, that is, the use of weighted sums of standard distributions as in

$\sum_{i=1}^k p_i f({\mathbf y};{\mathbf \theta}_i)\,,$

is a widespread and increasingly used technique to overcome the rigidity of standard parametric distributions such as f(y;θ), while retaining a parametric nature, as exposed in the introduction of my JASA review to Böhning’s (1998) book on non-parametric mixture estimation (Robert, 2000). This review pointed out that, while there are many books available on the topic of mixture estimation, the unsurpassed reference remained the book by Titterington, Smith and Makov (1985)  [hereafter TSM]. I also suggested that a new edition of TSM would be quite timely, given the methodological and computational advances that took place in the past 15 years: while it remains unclear whether or not this new edition will ever take place, the book by McLachlan and Peel gives an enjoyable and fairly exhaustive update on the topic, incorporating the most recent advances on mixtures and some related models.

Geoff McLachlan has been a major actor in the field for at least 25 years, through papers, software—the book concludes with a review of existing software—and books: McLachlan (1992), McLachlan and Basford (1988), and McLachlan and Krishnan (1997). I refer the reader to Lindsay (1989) for a review of the second book, which is a forerunner of, and has much in common with, the present book. Continue reading

## from statistical evidence to evidence of causality

Posted in Books, Statistics with tags , , , , , , , , , on December 24, 2013 by xi'an

I took the opportunity of having to wait at a local administration a long while today (!) to read an arXived paper by Dawid, Musio and Fienberg on the−both philosophical and practical−difficulty to establish the probabilities of the causes of effects. The first interesting thing about the paper is that it relates to the Médiator drug scandal that took place in France in the past year and still is under trial: thanks to the investigations of a local doctor, Irène Frachon, the drug was exposed as an aggravating factor for heart disease. Or maybe the cause. The case-control study of Frachon summarises into a 2×2 table with a corrected odds ratio of 17.1. From there, the authors expose the difficulties of drawing inference about causes of effects, i.e. causality, an aspect of inference that has always puzzled me. (And the paper led me to search for the distinction between odds ratio and risk ratio.)

“And the conceptual and implementational difficulties that we discuss below, that beset even the simplest case of inference about causes of effects, will be hugely magnified when we wish to take additional account of such policy considerations.”

A third interesting notion in the paper is the inclusion of counterfactuals. My introduction to counterfactuals dates back to a run in the back-country roads around Ithaca, New York, when George told me about a discussion paper from Phil he was editing for JASA on that notion with his philosopher neighbour Steven Schwartz as a discussant. (It was a great run, presumably in the late Spring. And the best introduction I could dream of!) Now, the paper starts from the counterfactual perspective to conclude that inference is close to impossible in this setting. Within my limited understanding, I would see that as a drawback of using counterfactuals, rather than of drawing inference about causes. If the corresponding statistical model is nonindentifiable, because one of the two responses is always missing, the model seems inappropriate. I am also surprised at the notion of “sufficiency” used in the paper, since it sounds like the background information cancels the need to account for the treatment (e.g., aspirin) decision.  The fourth point is the derivation of bounds on the probabilities of causation, despite everything! Quite an interesting read thus!

## drawing surface plots on the IR³ simplex

Posted in pictures, R, Statistics, University life with tags , , , , , , on October 18, 2013 by xi'an

As a result of a corridor conversation in Warwick, I started looking at distributions on the IR³ simplex,

$\{(x_1,x_2,x_3)\in\mathbb{R}_+^3;\ x_1+x_2+x_3=1\},$

and wanted to plot the density in a nice way. As I could not find a proper package on CRAN, the closer being the BMAmevt (for Bayesian Model Averaging for Multivariate Extremes) R package developed by a former TSI Master student, Anne Sabourin, I ended up programming the thing myself. And producing the picture above. Here is the code, for all it is worth:

# setting the limits
par(mar=c(0,0,0,0),bg="black")
plot(c(0,1),col="white",axes=F,xlab="",ylab="",
xlim=c(-1,1)*1.1/sqrt(2),ylim=c(-.1,sqrt(3/2))*1.1)

# density on a grid with NAs outside, as in image()
gride=matrix(NA,ncol=520,nrow=520)
ww3=ww2=seq(.01,.99,le=520)
for (i in 1:520){
cur=ww2[i];op=1-cur
for (j in 1:(521-i))
gride[i,j]=mydensity(c(cur,ww3[j],op-ww3[j]))
}

# preparing the graph
subset=(1:length(gride))[!is.na(gride)]
logride=log(gride[subset])
grida=(logride-min(logride))/diff(range(logride))
grolor=terrain.colors(250)[1+trunc(as.vector(grida)*250)]
iis=(subset%%520)+520*(subset==520)
jis=(subset%/%520)+1

# plotting the value of the (log-)density
# at each point of the grid
points(x=(ww3[jis]-ww2[iis])/sqrt(2),
y=(1-ww3[jis]-ww2[iis])/sqrt(2/3),
pch=20,col=grolor,cex=.3)


## a general framework for updating belief functions [reply from the authors]

Posted in Statistics, University life with tags , , , , , , on July 16, 2013 by xi'an

Thanks to Christian for the comments and feedback on our paper “A General Framework for Updating Belief Distributions“. We agree with Christian that starting with a summary statistic, or statistics, is an anchor for inference or learning, providing direction and guidance for models, avoiding the alternative vague notion of attempting to model a complete data set. The latter idea has dominated the Bayesian methodology for decades, but with the advent of large and complex data sets, this is becoming increasingly challenging, if not impossible.

However, in order to do work with statistics of interest, we need to find a framework in which this direct approach can be supported by a learning strategy when the formal use of Bayes theorem is not applicable. We achieve this in the paper for a general class of loss functions, which connect observations with a target of interest. A point raised by Christian is how arbitrary these loss functions are. We do not see this at all; for if a target has been properly identified then the most primitive construct between observations informing about a target and the target would come in the form of a loss function. One should always be able to assess the loss of ascertaining a value of $\theta$ as an action and providing the loss in the presence of observation x. The question to be discussed is whether loss functions are objective, as in the case of the median loss,

$l(\theta,x)=|\theta-x|$

or subjective, in the case of the choice between loss functions for estimating a location of a distribution; mean, median or mode? But our work is situated in the former position.

Previous work on loss functions, mostly in the classical literature, has spent a lot of space working out what are optimal loss functions for targets of interest. We are not really dealing with novel targets and so we can draw on the classic literature here. The work can be thought of as the Bayesian version of the M-estimator and associated ideas. In this respect we are dealing with two loss functions for updating belief distributions, one for the data, which we have just discussed, and one for the prior information, which, due to coherence principles, must be the Kullback-Leibler divergence. This raises the thorny issue of how to calibrate the two loss functions. We discuss this in the paper.

To then deal with the statistic problem, mentioned at the start of this discussion, we have found a nice way to proceed by using the loss function $l(\theta,x)=-\log f(x|\theta)$. How this loss function, combined with the use of the exponential family, can be used to estimate functionals of the type

$I=\int g(x)\,f_0(x)\, dx$

is provided in the Walker talk at Bayes 250 in London, titled “The Misspecified Bayesian”, since the “model” $f(x|\theta)$ is designed to be misspecified, a tool to estimate and learn about I only. The basic idea is to evaluate I by ensuring that we learn about the $\theta_0$ for which

$I=\int g(x)\,f(x|\theta_0)\, dx.$

This is the story of the background, we would now like to pick up in more detail on three important points that you raise in your post:

1. The arbitrariness in selecting the loss function.
2. The relative weighting of loss-to-data vs. loss-to-prior.
3. The selection of the loss in the M-free case.

In the absence of complete knowledge of the data generating mechanism, i.e. outside of M-closed,

1. We believe the statistician should weigh up the relative arbitrariness in selecting a loss function targeting the statistic of interest versus the arbitrariness of selecting a misspecified model, known not to be true, for the complete data generating mechanism. There is a wealth of literature on how to select optimal loss functions that target specific statistics, e.g. Hüber (2009) provides a comprehensive overview of how this should be done. As far as we are aware, we know of no formal procedures (that do not rely on loss functions) to select a false sampling distribution $f(x|\theta)$ for the whole of x; see Key, Pericchi and Smith (1999).
2. The relative weighting of loss-to-data vs. loss-to-prior. This is an interesting open problem. Our framework shows in the absence of M-closed or the use of self-information loss that the analyst must select this weighting. In our paper we suggest some default procedures. We have nowhere claimed these were “correct”. You raise concerns regards parameterisation and we agree with you that care is needed, but many of these issues equally hold for existing “Objective” or “Default” Bayes procedures, such as unit-information priors.
3. The selection of the loss in M-free. You say “….there is no optimal choice for the substitute to the loss function…”. We disagree. Our approach is to select an established loss function that directly targets the statistic of interest, and elicit prior beliefs directly on the unknown value of this statistic. There is no notion here of a a pseudo-likelihood or where this converges to.

Thank you again to Christian for his critical observations!