## Bayesian programming [book review]

Posted in Books, Kids, pictures, Statistics, University life with tags , , , , , , , , , , on March 3, 2014 by xi'an

“We now think the Bayesian Programming methodology and tools are reaching maturity. The goal of this book is to present them so that anyone is able to use them. We will, of course, continue to improve tools and develop new models. However, pursuing the idea that probability is an alternative to Boolean logic, we now have a new important research objective, which is to design specific hsrdware, inspired from biology, to build a Bayesian computer.”(p.xviii)

On the plane to and from Montpellier, I took an extended look at Bayesian Programming a CRC Press book recently written by Pierre Bessière, Emmanuel Mazer, Juan-Manuel Ahuactzin, and Kamel Mekhnacha. (Very nice picture of a fishing net on the cover, by the way!) Despite the initial excitement at seeing a book which final goal was to achieve a Bayesian computer, as demonstrated by the above quote, I however soon found the book too arid to read due to its highly formalised presentation… The contents are clear indications that the approach is useful as they illustrate the use of Bayesian programming in different decision-making settings, including a collection of Python codes, so it brings an answer to the what but it somehow misses the how in that the construction of the priors and the derivation of the posteriors is not explained in a way one could replicate.

“A modeling methodology is not sufficient to run Bayesian programs. We also require an efficient Bayesian inference engine to automate the probabilistic calculus. This assumes we have a collection of inference algorithms adapted and tuned to more or less specific models and a software architecture to combine them in a coherent and unique tool.” (p.9)

For instance, all models therein are described via the curly brace formalism summarised by

which quickly turns into an unpalatable object, as in this example taken from the online PhD thesis of Gabriel Synnaeve (where he applied Bayesian programming principles to a MMORPG called StarCraft and developed an AI (or bot) able to play BroodwarBotQ)

thesis that I found most interesting!

“Consequently, we have 21 × 16 = 336 bell-shaped distributions and we have 2 × 21 × 16 = 772 free parameters: 336 means and 336 standard deviations.¨(p.51)

Now, getting back to the topic of the book, I can see connections with statistical problems and models, and not only via the application of Bayes’ theorem, when the purpose (or Question) is to take a decision, for instance in a robotic action. I still remain puzzled by the purpose of the book, since it starts with very low expectations on the reader, but hurries past notions like Kalman filters and Metropolis-Hastings algorithms in a few paragraphs. I do not get some of the details, like this notion of a discretised Gaussian distribution (I eventually found the place where the 772 prior parameters are “learned” in a phase called “identification”.)

“Thanks to conditional independence the curse of dimensionality has been broken! What has been shown to be true here for the required memory space is also true for the complexity of inferences. Conditional independence is the principal tool to keep the calculation tractable. Tractability of Bayesian inference computation is of course a major concern as it has been proved NP-hard (Cooper, 1990).”(p.74)

The final chapters (Chap. 14 on “Bayesian inference algorithms revisited”, Chap. 15 on “Bayesian learning revisited” and  Chap. 16 on “Frequently asked questions and frequently argued matters” [!]) are definitely those I found easiest to read and relate to. With mentions made of conjugate priors and of the EM algorithm as a (Bayes) classifier. The final chapter mentions BUGS, Hugin and… Stan! Plus a sequence of 23 PhD theses defended on Bayesian programming for robotics in the past 20 years. And explains the authors’ views on the difference between Bayesian programming and Bayesian networks (“any Bayesian network can be represented in the Bayesian programming formalism, but the opposite is not true”, p.316), between Bayesian programming and probabilistic programming (“we do not search to extend classical languages but rather to replace them by a new programming approach based on probability”, p.319), between Bayesian programming and Bayesian modelling (“Bayesian programming goes one step further”, p.317), with a further (self-)justification of why the book sticks to discrete variables, and further more philosophical sections referring to Jaynes and the principle of maximum entropy.

“The “objectivity” of the subjectivist approach then lies in the fact that two different subjects with same preliminary knowledge and same observations will inevitably reach the same conclusions.”(p.327)

Bayesian Programming thus provides a good snapshot of (or window on) what one can achieve in uncertain environment decision-making with Bayesian techniques. It shows a long-term reflection on those notions by Pierre Bessière, his colleagues and students. The topic is most likely too remote from my own interests for the above review to be complete. Therefore, if anyone is interested in reviewing any further this book for CHANCE, before I send the above to the journal, please contact me. (Usual provisions apply.)

## finite mixture models [book review]

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , , on February 17, 2014 by xi'an

Here is a review of Finite Mixture Models (2000) by Geoff McLachlan & David Peel that I wrote aeons ago (circa 1999), supposedly for JASA, which lost first the files and second the will to publish it. As I was working with my student today, I mentioned the book to her and decided to publish it here, if only because I think the book deserved a positive review, even after all those years! (Since then, Sylvia Frühwirth-Schnatter published Finite Mixture and Markov Switching Models (2004), which is closer to my perspective on the topic and that I would more naturally recommend.)

Mixture modeling, that is, the use of weighted sums of standard distributions as in

$\sum_{i=1}^k p_i f({\mathbf y};{\mathbf \theta}_i)\,,$

is a widespread and increasingly used technique to overcome the rigidity of standard parametric distributions such as f(y;θ), while retaining a parametric nature, as exposed in the introduction of my JASA review to Böhning’s (1998) book on non-parametric mixture estimation (Robert, 2000). This review pointed out that, while there are many books available on the topic of mixture estimation, the unsurpassed reference remained the book by Titterington, Smith and Makov (1985)  [hereafter TSM]. I also suggested that a new edition of TSM would be quite timely, given the methodological and computational advances that took place in the past 15 years: while it remains unclear whether or not this new edition will ever take place, the book by McLachlan and Peel gives an enjoyable and fairly exhaustive update on the topic, incorporating the most recent advances on mixtures and some related models.

Geoff McLachlan has been a major actor in the field for at least 25 years, through papers, software—the book concludes with a review of existing software—and books: McLachlan (1992), McLachlan and Basford (1988), and McLachlan and Krishnan (1997). I refer the reader to Lindsay (1989) for a review of the second book, which is a forerunner of, and has much in common with, the present book. Continue reading

## from statistical evidence to evidence of causality

Posted in Books, Statistics with tags , , , , , , , , , on December 24, 2013 by xi'an

I took the opportunity of having to wait at a local administration a long while today (!) to read an arXived paper by Dawid, Musio and Fienberg on the−both philosophical and practical−difficulty to establish the probabilities of the causes of effects. The first interesting thing about the paper is that it relates to the Médiator drug scandal that took place in France in the past year and still is under trial: thanks to the investigations of a local doctor, Irène Frachon, the drug was exposed as an aggravating factor for heart disease. Or maybe the cause. The case-control study of Frachon summarises into a 2×2 table with a corrected odds ratio of 17.1. From there, the authors expose the difficulties of drawing inference about causes of effects, i.e. causality, an aspect of inference that has always puzzled me. (And the paper led me to search for the distinction between odds ratio and risk ratio.)

“And the conceptual and implementational difficulties that we discuss below, that beset even the simplest case of inference about causes of effects, will be hugely magnified when we wish to take additional account of such policy considerations.”

A third interesting notion in the paper is the inclusion of counterfactuals. My introduction to counterfactuals dates back to a run in the back-country roads around Ithaca, New York, when George told me about a discussion paper from Phil he was editing for JASA on that notion with his philosopher neighbour Steven Schwartz as a discussant. (It was a great run, presumably in the late Spring. And the best introduction I could dream of!) Now, the paper starts from the counterfactual perspective to conclude that inference is close to impossible in this setting. Within my limited understanding, I would see that as a drawback of using counterfactuals, rather than of drawing inference about causes. If the corresponding statistical model is nonindentifiable, because one of the two responses is always missing, the model seems inappropriate. I am also surprised at the notion of “sufficiency” used in the paper, since it sounds like the background information cancels the need to account for the treatment (e.g., aspirin) decision.  The fourth point is the derivation of bounds on the probabilities of causation, despite everything! Quite an interesting read thus!

## drawing surface plots on the IR³ simplex

Posted in pictures, R, Statistics, University life with tags , , , , , , on October 18, 2013 by xi'an

As a result of a corridor conversation in Warwick, I started looking at distributions on the IR³ simplex,

$\{(x_1,x_2,x_3)\in\mathbb{R}_+^3;\ x_1+x_2+x_3=1\},$

and wanted to plot the density in a nice way. As I could not find a proper package on CRAN, the closer being the BMAmevt (for Bayesian Model Averaging for Multivariate Extremes) R package developed by a former TSI Master student, Anne Sabourin, I ended up programming the thing myself. And producing the picture above. Here is the code, for all it is worth:

# setting the limits
par(mar=c(0,0,0,0),bg="black")
plot(c(0,1),col="white",axes=F,xlab="",ylab="",
xlim=c(-1,1)*1.1/sqrt(2),ylim=c(-.1,sqrt(3/2))*1.1)

# density on a grid with NAs outside, as in image()
gride=matrix(NA,ncol=520,nrow=520)
ww3=ww2=seq(.01,.99,le=520)
for (i in 1:520){
cur=ww2[i];op=1-cur
for (j in 1:(521-i))
gride[i,j]=mydensity(c(cur,ww3[j],op-ww3[j]))
}

# preparing the graph
subset=(1:length(gride))[!is.na(gride)]
logride=log(gride[subset])
grida=(logride-min(logride))/diff(range(logride))
grolor=terrain.colors(250)[1+trunc(as.vector(grida)*250)]
iis=(subset%%520)+520*(subset==520)
jis=(subset%/%520)+1

# plotting the value of the (log-)density
# at each point of the grid
points(x=(ww3[jis]-ww2[iis])/sqrt(2),
y=(1-ww3[jis]-ww2[iis])/sqrt(2/3),
pch=20,col=grolor,cex=.3)


## a general framework for updating belief functions [reply from the authors]

Posted in Statistics, University life with tags , , , , , , on July 16, 2013 by xi'an

Thanks to Christian for the comments and feedback on our paper “A General Framework for Updating Belief Distributions“. We agree with Christian that starting with a summary statistic, or statistics, is an anchor for inference or learning, providing direction and guidance for models, avoiding the alternative vague notion of attempting to model a complete data set. The latter idea has dominated the Bayesian methodology for decades, but with the advent of large and complex data sets, this is becoming increasingly challenging, if not impossible.

However, in order to do work with statistics of interest, we need to find a framework in which this direct approach can be supported by a learning strategy when the formal use of Bayes theorem is not applicable. We achieve this in the paper for a general class of loss functions, which connect observations with a target of interest. A point raised by Christian is how arbitrary these loss functions are. We do not see this at all; for if a target has been properly identified then the most primitive construct between observations informing about a target and the target would come in the form of a loss function. One should always be able to assess the loss of ascertaining a value of $\theta$ as an action and providing the loss in the presence of observation x. The question to be discussed is whether loss functions are objective, as in the case of the median loss,

$l(\theta,x)=|\theta-x|$

or subjective, in the case of the choice between loss functions for estimating a location of a distribution; mean, median or mode? But our work is situated in the former position.

Previous work on loss functions, mostly in the classical literature, has spent a lot of space working out what are optimal loss functions for targets of interest. We are not really dealing with novel targets and so we can draw on the classic literature here. The work can be thought of as the Bayesian version of the M-estimator and associated ideas. In this respect we are dealing with two loss functions for updating belief distributions, one for the data, which we have just discussed, and one for the prior information, which, due to coherence principles, must be the Kullback-Leibler divergence. This raises the thorny issue of how to calibrate the two loss functions. We discuss this in the paper.

To then deal with the statistic problem, mentioned at the start of this discussion, we have found a nice way to proceed by using the loss function $l(\theta,x)=-\log f(x|\theta)$. How this loss function, combined with the use of the exponential family, can be used to estimate functionals of the type

$I=\int g(x)\,f_0(x)\, dx$

is provided in the Walker talk at Bayes 250 in London, titled “The Misspecified Bayesian”, since the “model” $f(x|\theta)$ is designed to be misspecified, a tool to estimate and learn about I only. The basic idea is to evaluate I by ensuring that we learn about the $\theta_0$ for which

$I=\int g(x)\,f(x|\theta_0)\, dx.$

This is the story of the background, we would now like to pick up in more detail on three important points that you raise in your post:

1. The arbitrariness in selecting the loss function.
2. The relative weighting of loss-to-data vs. loss-to-prior.
3. The selection of the loss in the M-free case.

In the absence of complete knowledge of the data generating mechanism, i.e. outside of M-closed,

1. We believe the statistician should weigh up the relative arbitrariness in selecting a loss function targeting the statistic of interest versus the arbitrariness of selecting a misspecified model, known not to be true, for the complete data generating mechanism. There is a wealth of literature on how to select optimal loss functions that target specific statistics, e.g. Hüber (2009) provides a comprehensive overview of how this should be done. As far as we are aware, we know of no formal procedures (that do not rely on loss functions) to select a false sampling distribution $f(x|\theta)$ for the whole of x; see Key, Pericchi and Smith (1999).
2. The relative weighting of loss-to-data vs. loss-to-prior. This is an interesting open problem. Our framework shows in the absence of M-closed or the use of self-information loss that the analyst must select this weighting. In our paper we suggest some default procedures. We have nowhere claimed these were “correct”. You raise concerns regards parameterisation and we agree with you that care is needed, but many of these issues equally hold for existing “Objective” or “Default” Bayes procedures, such as unit-information priors.
3. The selection of the loss in M-free. You say “….there is no optimal choice for the substitute to the loss function…”. We disagree. Our approach is to select an established loss function that directly targets the statistic of interest, and elicit prior beliefs directly on the unknown value of this statistic. There is no notion here of a a pseudo-likelihood or where this converges to.

Thank you again to Christian for his critical observations!

## a general framework for updating belief functions

Posted in Books, Statistics, University life with tags , , , , , , , , , on July 15, 2013 by xi'an

Pier Giovanni Bissiri, Chris Holmes and Stephen Walker have recently arXived the paper related to Sephen’s talk in London for Bayes 250. When I heard the talk (of which some slides are included below), my interest was aroused by the facts that (a) the approach they investigated could start from a statistics, rather than from a full model, with obvious implications for ABC, & (b) the starting point could be the dual to the prior x likelihood pair, namely the loss function. I thus read the paper with this in mind. (And rather quickly, which may mean I skipped important aspects. For instance, I did not get into Section 4 to any depth. Disclaimer: I wasn’t nor is a referee for this paper!)

The core idea is to stick to a Bayesian (hardcore?) line when missing the full model, i.e. the likelihood of the data, but wishing to infer about a well-defined parameter like the median of the observations. This parameter is model-free in that some degree of prior information is available in the form of a prior distribution. (This is thus the dual of frequentist inference: instead of a likelihood w/o a prior, they have a prior w/o a likelihood!) The approach in the paper is to define a “posterior” by using a functional type of loss function that balances fidelity to prior and fidelity to data. The prior part (of the loss) ends up with a Kullback-Leibler loss, while the data part (of the loss) is an expected loss wrt to l(THETASoEUR,x), ending up with the definition of a “posterior” that is

$\exp\{ -l(\theta,x)\} \pi(\theta)$

the loss thus playing the role of the log-likelihood.

I like very much the problematic developed in the paper, as I think it is connected with the real world and the complex modelling issues we face nowadays. I also like the insistence on coherence like the updating principle when switching former posterior for new prior (a point sorely missed in this book!) The distinction between M-closed M-open, and M-free scenarios is worth mentioning, if only as an entry to the Bayesian processing of pseudo-likelihood and proxy models. I am however not entirely convinced by the solution presented therein, in that it involves a rather large degree of arbitrariness. In other words, while I agree on using the loss function as a pivot for defining the pseudo-posterior, I am reluctant to put the same faith in the loss as in the log-likelihood (maybe a frequentist atavistic gene somewhere…) In particular, I think some of the choices are either hard or impossible to make and remain unprincipled (despite a call to the LP on page 7).  I also consider the M-open case as remaining unsolved as finding a convergent assessment about the pseudo-true parameter brings little information about the real parameter and the lack of fit of the superimposed model. Given my great expectations, I ended up being disappointed by the M-free case: there is no optimal choice for the substitute to the loss function that sounds very much like a pseudo-likelihood (or log thereof). (I thought the talk was more conclusive about this, I presumably missed a slide there!) Another great expectation was to read about the proper scaling of the loss function (since L and wL are difficult to separate, except for monetary losses). The authors propose a “correct” scaling based on balancing both faithfulness for a single observation, but this is not a completely tight argument (dependence on parametrisation and prior, notion of a single observation, &tc.)

The illustration section contains two examples, one of which is a full-size or at least challenging  genetic data analysis. The loss function is based on a logistic  pseudo-likelihood and it provides results where the Bayes factor is in agreement with a likelihood ratio test using Cox’ proportional hazard model. The issue about keeping the baseline function as unkown reminded me of the Robbins-Wasserman paradox Jamie discussed in Varanasi. The second example offers a nice feature of putting uncertainties onto box-plots, although I cannot trust very much the 95%  of the credibles sets. (And I do not understand why a unique loss would come to be associated with the median parameter, see p.25.)

Watch out: Tomorrow’s post contains a reply from the authors!

## a partial review of BISP8 [guest post]

Posted in Statistics, Travel, University life with tags , , , , , , , on June 17, 2013 by xi'an

Chris Drovandi (QUT) sent me his impression on BISP8 that just took place in Milano, Italia (BISP stands for Bayesian inference in stochastic processes):

Here is a review of some of the talks at BISP8. For the other talks I do not have sufficient background to give the talks the justice that they deserve. It was a very enjoyable small workshop with many talks in my areas of interest.

In the first session Vanja Dukic presented bayesian inference of SEIR epidemic DE models and state space models of google flu trends data. In the case of the state space models a particle learning algorithm was developed. The author considered both fixed and random effects for the data in each US state. In the second session, Murali Haran presented a likelihood-free approach for inferring the parameters of a spatio-temporal epidemic model. The speaker used a Gaussian process emulator of the model based on model simulations from a regulator grid of parameter values. The emulator approach is suggested to be less intensive in terms of the number of model simulations compared with abc but is only suitable for low dimensional inference problems (even less so than abc).

In the first session of day 2 Ana Palacios combined the gompertz model with Markov processes to create flexible and realistic stochastic growth models. The resulting model has a difficult likelihood and inference was performed by completing the likelihood creating simple Gibbs moves and by ABC.

There were 3 talks in a row on inference for SDEs. The first, by Simon Särkkä, avoids evaluating an intractable transition density by proposing from another diffusion model and computing importance weights using the girsanov theorem. Next, Samuel Kou used a population MCMC type approach where each chain had a different Euler discretisation. This helps improve mixing for the chain with the finest grid. Moves between chains are complicated by the different dimension for each chain. The author used a filling approach to overcome this. A very interesting aspect of the talk was using information from all chains to extrapolate various posterior quantiles to delta_t is 0 (no discretisation implying the correct posterior). I assume the extrapolation may not work as well for the extreme quantiles. The third talk, by Andrew Golightly, proposed an auxiliary approach to improve PMCMC for these models. This talk was the most technical (for me) so need more time to digest. Following my talk (based on some work here.  And some current work.) was an applied talk using smc2 methodology.

On the final day Alexandros Beskos investigated the use of SMC for Bayesian inference for a high dimensional (static) parameter. SMC is advocated here due to the ease of adaptation relative to MCMC when there is no structure in the model. The base of the approach I believe was that of Chopin (2002).