## from statistical evidence to evidence of causality

Posted in Books, Statistics with tags , , , , , , , , , on December 24, 2013 by xi'an

I took the opportunity of having to wait at a local administration a long while today (!) to read an arXived paper by Dawid, Musio and Fienberg on the−both philosophical and practical−difficulty to establish the probabilities of the causes of effects. The first interesting thing about the paper is that it relates to the Médiator drug scandal that took place in France in the past year and still is under trial: thanks to the investigations of a local doctor, Irène Frachon, the drug was exposed as an aggravating factor for heart disease. Or maybe the cause. The case-control study of Frachon summarises into a 2×2 table with a corrected odds ratio of 17.1. From there, the authors expose the difficulties of drawing inference about causes of effects, i.e. causality, an aspect of inference that has always puzzled me. (And the paper led me to search for the distinction between odds ratio and risk ratio.)

“And the conceptual and implementational difficulties that we discuss below, that beset even the simplest case of inference about causes of effects, will be hugely magnified when we wish to take additional account of such policy considerations.”

A third interesting notion in the paper is the inclusion of counterfactuals. My introduction to counterfactuals dates back to a run in the back-country roads around Ithaca, New York, when George told me about a discussion paper from Phil he was editing for JASA on that notion with his philosopher neighbour Steven Schwartz as a discussant. (It was a great run, presumably in the late Spring. And the best introduction I could dream of!) Now, the paper starts from the counterfactual perspective to conclude that inference is close to impossible in this setting. Within my limited understanding, I would see that as a drawback of using counterfactuals, rather than of drawing inference about causes. If the corresponding statistical model is nonindentifiable, because one of the two responses is always missing, the model seems inappropriate. I am also surprised at the notion of “sufficiency” used in the paper, since it sounds like the background information cancels the need to account for the treatment (e.g., aspirin) decision.  The fourth point is the derivation of bounds on the probabilities of causation, despite everything! Quite an interesting read thus!

## drawing surface plots on the IR³ simplex

Posted in pictures, R, Statistics, University life with tags , , , , , , on October 18, 2013 by xi'an

As a result of a corridor conversation in Warwick, I started looking at distributions on the IR³ simplex,

$\{(x_1,x_2,x_3)\in\mathbb{R}_+^3;\ x_1+x_2+x_3=1\},$

and wanted to plot the density in a nice way. As I could not find a proper package on CRAN, the closer being the BMAmevt (for Bayesian Model Averaging for Multivariate Extremes) R package developed by a former TSI Master student, Anne Sabourin, I ended up programming the thing myself. And producing the picture above. Here is the code, for all it is worth:

# setting the limits
par(mar=c(0,0,0,0),bg="black")
plot(c(0,1),col="white",axes=F,xlab="",ylab="",
xlim=c(-1,1)*1.1/sqrt(2),ylim=c(-.1,sqrt(3/2))*1.1)

# density on a grid with NAs outside, as in image()
gride=matrix(NA,ncol=520,nrow=520)
ww3=ww2=seq(.01,.99,le=520)
for (i in 1:520){
cur=ww2[i];op=1-cur
for (j in 1:(521-i))
gride[i,j]=mydensity(c(cur,ww3[j],op-ww3[j]))
}

# preparing the graph
subset=(1:length(gride))[!is.na(gride)]
logride=log(gride[subset])
grida=(logride-min(logride))/diff(range(logride))
grolor=terrain.colors(250)[1+trunc(as.vector(grida)*250)]
iis=(subset%%520)+520*(subset==520)
jis=(subset%/%520)+1

# plotting the value of the (log-)density
# at each point of the grid
points(x=(ww3[jis]-ww2[iis])/sqrt(2),
y=(1-ww3[jis]-ww2[iis])/sqrt(2/3),
pch=20,col=grolor,cex=.3)


## a general framework for updating belief functions [reply from the authors]

Posted in Statistics, University life with tags , , , , , , on July 16, 2013 by xi'an

Here is the reply by Chris and Steve about my comments from yesterday:

Thanks to Christian for the comments and feedback on our paper “A General Framework for Updating Belief Distributions“. We agree with Christian that starting with a summary statistic, or statistics, is an anchor for inference or learning, providing direction and guidance for models, avoiding the alternative vague notion of attempting to model a complete data set. The latter idea has dominated the Bayesian methodology for decades, but with the advent of large and complex data sets, this is becoming increasingly challenging, if not impossible.

However, in order to do work with statistics of interest, we need to find a framework in which this direct approach can be supported by a learning strategy when the formal use of Bayes theorem is not applicable. We achieve this in the paper for a general class of loss functions, which connect observations with a target of interest. A point raised by Christian is how arbitrary these loss functions are. We do not see this at all; for if a target has been properly identified then the most primitive construct between observations informing about a target and the target would come in the form of a loss function. One should always be able to assess the loss of ascertaining a value of $\theta$ as an action and providing the loss in the presence of observation x. The question to be discussed is whether loss functions are objective, as in the case of the median loss,

$l(\theta,x)=|\theta-x|$

or subjective, in the case of the choice between loss functions for estimating a location of a distribution; mean, median or mode? But our work is situated in the former position.

Previous work on loss functions, mostly in the classical literature, has spent a lot of space working out what are optimal loss functions for targets of interest. We are not really dealing with novel targets and so we can draw on the classic literature here. The work can be thought of as the Bayesian version of the M-estimator and associated ideas. In this respect we are dealing with two loss functions for updating belief distributions, one for the data, which we have just discussed, and one for the prior information, which, due to coherence principles, must be the Kullback-Leibler divergence. This raises the thorny issue of how to calibrate the two loss functions. We discuss this in the paper.

To then deal with the statistic problem, mentioned at the start of this discussion, we have found a nice way to proceed by using the loss function $l(\theta,x)=-\log f(x|\theta)$. How this loss function, combined with the use of the exponential family, can be used to estimate functionals of the type

$I=\int g(x)\,f_0(x)\, dx$

is provided in the Walker talk at Bayes 250 in London, titled “The Misspecified Bayesian”, since the “model” $f(x|\theta)$ is designed to be misspecified, a tool to estimate and learn about I only. The basic idea is to evaluate I by ensuring that we learn about the $\theta_0$ for which

$I=\int g(x)\,f(x|\theta_0)\, dx.$

This is the story of the background, we would now like to pick up in more detail on three important points that you raise in your post:

1. The arbitrariness in selecting the loss function.
2. The relative weighting of loss-to-data vs. loss-to-prior.
3. The selection of the loss in the M-free case.

In the absence of complete knowledge of the data generating mechanism, i.e. outside of M-closed,

1. We believe the statistician should weigh up the relative arbitrariness in selecting a loss function targeting the statistic of interest versus the arbitrariness of selecting a misspecified model, known not to be true, for the complete data generating mechanism. There is a wealth of literature on how to select optimal loss functions that target specific statistics, e.g. Hüber (2009) provides a comprehensive overview of how this should be done. As far as we are aware, we know of no formal procedures (that do not rely on loss functions) to select a false sampling distribution $f(x|\theta)$ for the whole of x; see Key, Pericchi and Smith (1999).
2. The relative weighting of loss-to-data vs. loss-to-prior. This is an interesting open problem. Our framework shows in the absence of M-closed or the use of self-information loss that the analyst must select this weighting. In our paper we suggest some default procedures. We have nowhere claimed these were “correct”. You raise concerns regards parameterisation and we agree with you that care is needed, but many of these issues equally hold for existing “Objective” or “Default” Bayes procedures, such as unit-information priors.
3. The selection of the loss in M-free. You say “….there is no optimal choice for the substitute to the loss function…”. We disagree. Our approach is to select an established loss function that directly targets the statistic of interest, and elicit prior beliefs directly on the unknown value of this statistic. There is no notion here of a a pseudo-likelihood or where this converges to.

Thank you again to Christian for his critical observations!

## a general framework for updating belief functions

Posted in Books, Statistics, University life with tags , , , , , , , , , on July 15, 2013 by xi'an

Pier Giovanni Bissiri, Chris Holmes and Stephen Walker have recently arXived the paper related to Sephen’s talk in London for Bayes 250. When I heard the talk (of which some slides are included below), my interest was aroused by the facts that (a) the approach they investigated could start from a statistics, rather than from a full model, with obvious implications for ABC, & (b) the starting point could be the dual to the prior x likelihood pair, namely the loss function. I thus read the paper with this in mind. (And rather quickly, which may mean I skipped important aspects. For instance, I did not get into Section 4 to any depth. Disclaimer: I wasn’t nor is a referee for this paper!)

The core idea is to stick to a Bayesian (hardcore?) line when missing the full model, i.e. the likelihood of the data, but wishing to infer about a well-defined parameter like the median of the observations. This parameter is model-free in that some degree of prior information is available in the form of a prior distribution. (This is thus the dual of frequentist inference: instead of a likelihood w/o a prior, they have a prior w/o a likelihood!) The approach in the paper is to define a “posterior” by using a functional type of loss function that balances fidelity to prior and fidelity to data. The prior part (of the loss) ends up with a Kullback-Leibler loss, while the data part (of the loss) is an expected loss wrt to l(THETASoEUR,x), ending up with the definition of a “posterior” that is

$\exp\{ -l(\theta,x)\} \pi(\theta)$

the loss thus playing the role of the log-likelihood.

I like very much the problematic developed in the paper, as I think it is connected with the real world and the complex modelling issues we face nowadays. I also like the insistence on coherence like the updating principle when switching former posterior for new prior (a point sorely missed in this book!) The distinction between M-closed M-open, and M-free scenarios is worth mentioning, if only as an entry to the Bayesian processing of pseudo-likelihood and proxy models. I am however not entirely convinced by the solution presented therein, in that it involves a rather large degree of arbitrariness. In other words, while I agree on using the loss function as a pivot for defining the pseudo-posterior, I am reluctant to put the same faith in the loss as in the log-likelihood (maybe a frequentist atavistic gene somewhere…) In particular, I think some of the choices are either hard or impossible to make and remain unprincipled (despite a call to the LP on page 7).  I also consider the M-open case as remaining unsolved as finding a convergent assessment about the pseudo-true parameter brings little information about the real parameter and the lack of fit of the superimposed model. Given my great expectations, I ended up being disappointed by the M-free case: there is no optimal choice for the substitute to the loss function that sounds very much like a pseudo-likelihood (or log thereof). (I thought the talk was more conclusive about this, I presumably missed a slide there!) Another great expectation was to read about the proper scaling of the loss function (since L and wL are difficult to separate, except for monetary losses). The authors propose a “correct” scaling based on balancing both faithfulness for a single observation, but this is not a completely tight argument (dependence on parametrisation and prior, notion of a single observation, &tc.)

The illustration section contains two examples, one of which is a full-size or at least challenging  genetic data analysis. The loss function is based on a logistic  pseudo-likelihood and it provides results where the Bayes factor is in agreement with a likelihood ratio test using Cox’ proportional hazard model. The issue about keeping the baseline function as unkown reminded me of the Robbins-Wasserman paradox Jamie discussed in Varanasi. The second example offers a nice feature of putting uncertainties onto box-plots, although I cannot trust very much the 95%  of the credibles sets. (And I do not understand why a unique loss would come to be associated with the median parameter, see p.25.)

Watch out: Tomorrow’s post contains a reply from the authors!

## a partial review of BISP8 [guest post]

Posted in Statistics, Travel, University life with tags , , , , , , , on June 17, 2013 by xi'an

Chris Drovandi (QUT) sent me his impression on BISP8 that just took place in Milano, Italia (BISP stands for Bayesian inference in stochastic processes):

Here is a review of some of the talks at BISP8. For the other talks I do not have sufficient background to give the talks the justice that they deserve. It was a very enjoyable small workshop with many talks in my areas of interest.

In the first session Vanja Dukic presented bayesian inference of SEIR epidemic DE models and state space models of google flu trends data. In the case of the state space models a particle learning algorithm was developed. The author considered both fixed and random effects for the data in each US state. In the second session, Murali Haran presented a likelihood-free approach for inferring the parameters of a spatio-temporal epidemic model. The speaker used a Gaussian process emulator of the model based on model simulations from a regulator grid of parameter values. The emulator approach is suggested to be less intensive in terms of the number of model simulations compared with abc but is only suitable for low dimensional inference problems (even less so than abc).

In the first session of day 2 Ana Palacios combined the gompertz model with Markov processes to create flexible and realistic stochastic growth models. The resulting model has a difficult likelihood and inference was performed by completing the likelihood creating simple Gibbs moves and by ABC.

There were 3 talks in a row on inference for SDEs. The first, by Simon Särkkä, avoids evaluating an intractable transition density by proposing from another diffusion model and computing importance weights using the girsanov theorem. Next, Samuel Kou used a population MCMC type approach where each chain had a different Euler discretisation. This helps improve mixing for the chain with the finest grid. Moves between chains are complicated by the different dimension for each chain. The author used a filling approach to overcome this. A very interesting aspect of the talk was using information from all chains to extrapolate various posterior quantiles to delta_t is 0 (no discretisation implying the correct posterior). I assume the extrapolation may not work as well for the extreme quantiles. The third talk, by Andrew Golightly, proposed an auxiliary approach to improve PMCMC for these models. This talk was the most technical (for me) so need more time to digest. Following my talk (based on some work here.  And some current work.) was an applied talk using smc2 methodology.

On the final day Alexandros Beskos investigated the use of SMC for Bayesian inference for a high dimensional (static) parameter. SMC is advocated here due to the ease of adaptation relative to MCMC when there is no structure in the model. The base of the approach I believe was that of Chopin (2002).

## Bayes 250 in London

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , on March 20, 2013 by xi'an

The two-day Bayes 250 Conference at the Royal Statistical Society is now officially announced with the complete programme on the RSS website. With the registration form available as well. A mix of eighteen junior and senior speakers covering the thematic and geographical spectra of UK Bayesian statistics. (It would be difficult not to acknowledge the top position of the United Kingdom in the list of contributions to Bayesian statistics!) Plus an interview of Dennis Lindley (pictured above in one of the rare pictures of Dennis available on the Web) by Tony O’Hagan! Thanks to Chris Holmes for organising this exciting meeting celebrating the 1763 publication of the Essay (with me “tagging along” as a co-organiser).

Here is a blurb I wrote as a presentation (pardon my French!):

2013 marks the 250th anniversary of the publication in Dec. 1763 of “An Essay towards solving a Problem in the Doctrine of Chances” in the Philosophical Transactions of the Royal Society of London, based on notes by Thomas Bayes and edited by Richard Price who submitted the Essay posthumously to Bayes’ death.

This publication is acknowledged as the birth certificate of what is now called Bayesian statistics and the Royal Statistical Society decided to celebrate this important milestone in the story of statistics (and not only UK statistics) by organising a conference on Bayesian statistics. The conference will take place at the RSS Headquarters in Errol Street and will run from June 19, late morning, to June 20, early afternoon. Everyone interested is welcome to present one’s work during the poster session on the afternoon of June 19.

The Royal Statistical Society is looking forward your participation in this event and hopes you will enjoy the variety in the presentations of the programme.

## who’s afraid of the big B wolf?

Posted in Books, Statistics, University life with tags , , , , , , , , , , on March 13, 2013 by xi'an

Aris Spanos just published a paper entitled “Who should be afraid of the Jeffreys-Lindley paradox?” in the journal Philosophy of Science. This piece is a continuation of the debate about frequentist versus llikelihoodist versus Bayesian (should it be Bayesianist?! or Laplacist?!) testing approaches, exposed in Mayo and Spanos’ Error and Inference, and discussed in several posts of the ‘Og. I started reading the paper in conjunction with a paper I am currently writing for a special volume in  honour of Dennis Lindley, paper that I will discuss later on the ‘Og…

“…the postdata severity evaluation (…) addresses the key problem with Fisherian p-values in the sense that the severity evaluation provides the “magnitude” of the warranted discrepancy from the null by taking into account the generic capacity of the test (that includes n) in question as it relates to the observed data”(p.88)

First, the antagonistic style of the paper is reminding me of Spanos’ previous works in that it relies on repeated value judgements (such as “Bayesian charge”, “blatant misinterpretation”, “Bayesian allegations that have undermined the credibility of frequentist statistics”, “both approaches are far from immune to fallacious interpretations”, “only crude rules of thumbs”, &tc.) and rhetorical sleights of hand. (See, e.g., “In contrast, the severity account ensures learning from data by employing trustworthy evidence (…), the reliability of evidence being calibrated in terms of the relevant error probabilities” [my stress].) Connectedly, Spanos often resorts to an unusual [at least for statisticians] vocabulary that amounts to newspeak. Here are some illustrations: “summoning the generic capacity of the test”, ‘substantively significant”, “custom tailoring the generic capacity of the test”, “the fallacy of acceptance”, “the relevance of the generic capacity of the particular test”, yes the term “generic capacity” is occurring there with a truly high frequency. Continue reading