## Why is it necessary to sample from the posterior distribution if we already KNOW the posterior distribution?

Posted in Statistics with tags , , , , , , , , on October 27, 2017 by xi'an

I found this question on X validated somewhat hilarious, the more because of the shouted KNOW! And the confused impression that because one can write down π(θ|x) up to a constant, one KNOWS this distribution… It is actually one of the paradoxes of simulation that, from a mathematical perspective, once π(θ|x) is available as a function of (θ,x), all other quantities related with this distribution are mathematically perfectly and uniquely defined. From a numerical perspective, this does not help. Actually, when starting my MCMC course at ENSAE a few days later, I had the same question from a student who thought facing a density function like

f(x) ∞ exp{-||x||²-||x||⁴-||x||⁶}

was enough to immediately produce simulations from this distribution. (I also used this example to show the degeneracy of accept-reject as the dimension d of x increases, using for instance a Gamma proposal on y=||x||. The acceptance probability plunges to zero with d, with 9 acceptances out of 10⁷ for d=20.)

## efficient acquisition rules for ABC

Posted in pictures, Statistics, University life with tags , , , , , , , , on June 5, 2017 by xi'an

A few weeks ago, Marko Järvenpää, Michael Gutmann, Aki Vehtari and Pekka Marttinen arXived a paper on sampling design for ABC that reminded me of presentations Michael gave at NIPS 2014 and in Banff last February. The main notion is that, when the simulation from the model is hugely expensive, random sampling does not make sense.

“While probabilistic modelling has been used to accelerate ABC inference, and strategies have been proposed for selecting which parameter to simulate next, little work has focused on trying to quantify the amount of uncertainty in the estimator of the ABC posterior density itself.”

The above question  is obviously interesting, if already considered in the literature for it seems to focus on the Monte Carlo error in ABC, addressed for instance in Fearnhead and Prangle (2012), Li and Fearnhead (2016) and our paper with David Frazier, Gael Martin, and Judith Rousseau. With corresponding conditions on the tolerance and the number of simulations to relegate Monte Carlo error to a secondary level. And the additional remark that the (error free) ABC distribution itself is not the ultimate quantity of interest. Or the equivalent (?) one that ABC is actually an exact Bayesian method on a completed space.

The paper initially confused me for a section on the very general formulation of ABC posterior approximation and error in this approximation. And simulation design for minimising this error. It confused me as it sounded too vague but only for a while as the remaining sections appear to be independent. The operational concept of the paper is to assume that the discrepancy between observed and simulated data, when perceived as a random function of the parameter θ, is a Gaussian process [over the parameter space]. This modelling allows for a prediction of the discrepancy at a new value of θ, which can be chosen as maximising the variance of the likelihood approximation. Or more precisely of the acceptance probability. While the authors report improved estimation of the exact posterior, I find no intuition as to why this should be the case when focussing on the discrepancy, especially because small discrepancies are associated with parameters approximately generated from the posterior.

## automated ABC summary combination

Posted in Books, pictures, Statistics, University life with tags , , , , , , , on March 16, 2017 by xi'an

Jonathan Harrison and Ruth Baker (Oxford University) arXived this morning a paper on the optimal combination of summaries for ABC in the sense of deriving the proper weights in an Euclidean distance involving all the available summaries. The idea is to find the weights that lead to the maximal distance between prior and posterior, in a way reminiscent of Bernardo’s (1979) maximal information principle. Plus a sparsity penalty à la Lasso. The associated algorithm is sequential in that the weights are updated at each iteration. The paper does not get into theoretical justifications but considers instead several examples with limited numbers of both parameters and summary statistics. Which may highlight the limitations of the approach in that handling (and eliminating) a large number of parameters may prove impossible this way, when compared with optimisation methods like random forests. Or summary-free distances between empirical distributions like the Wasserstein distance.

## MAP as Bayes estimators

Posted in Books, Kids, Statistics with tags , , , , on November 30, 2016 by xi'an

Robert Bassett and Julio Deride just arXived a paper discussing the position of MAPs within Bayesian decision theory. A point I have discussed extensively on the ‘Og!

“…we provide a counterexample to the commonly accepted notion of MAP estimators as a limit of Bayes estimators having 0-1 loss.”

The authors mention The Bayesian Choice stating this property without further precautions and I completely agree to being careless in this regard! The difficulty stands with the limit of the maximisers being not necessarily the maximiser of the limit. The paper includes an example to this effect, with a prior as above,  associated with a sampling distribution that does not depend on the parameter. The sufficient conditions proposed therein are that the posterior density is almost surely proper or quasiconcave.

This is a neat mathematical characterisation that cleans this “folk theorem” about MAP estimators. And for which the authors are to be congratulated! However, I am not very excited by the limiting property, whether it holds or not, as I have difficulties conceiving the use of a sequence of losses in a mildly realistic case. I rather prefer the alternate characterisation of MAP estimators by Burger and Lucka as proper Bayes estimators under another type of loss function, albeit a rather artificial one.

## drawing surface plots on the IR³ simplex

Posted in pictures, R, Statistics, University life with tags , , , , , , on October 18, 2013 by xi'an

As a result of a corridor conversation in Warwick, I started looking at distributions on the IR³ simplex,

$\{(x_1,x_2,x_3)\in\mathbb{R}_+^3;\ x_1+x_2+x_3=1\},$

and wanted to plot the density in a nice way. As I could not find a proper package on CRAN, the closer being the BMAmevt (for Bayesian Model Averaging for Multivariate Extremes) R package developed by a former TSI Master student, Anne Sabourin, I ended up programming the thing myself. And producing the picture above. Here is the code, for all it is worth:

# setting the limits
par(mar=c(0,0,0,0),bg="black")
plot(c(0,1),col="white",axes=F,xlab="",ylab="",
xlim=c(-1,1)*1.1/sqrt(2),ylim=c(-.1,sqrt(3/2))*1.1)

# density on a grid with NAs outside, as in image()
gride=matrix(NA,ncol=520,nrow=520)
ww3=ww2=seq(.01,.99,le=520)
for (i in 1:520){
cur=ww2[i];op=1-cur
for (j in 1:(521-i))
gride[i,j]=mydensity(c(cur,ww3[j],op-ww3[j]))
}

# preparing the graph
subset=(1:length(gride))[!is.na(gride)]
logride=log(gride[subset])
grida=(logride-min(logride))/diff(range(logride))
grolor=terrain.colors(250)[1+trunc(as.vector(grida)*250)]
iis=(subset%%520)+520*(subset==520)
jis=(subset%/%520)+1

# plotting the value of the (log-)density
# at each point of the grid
points(x=(ww3[jis]-ww2[iis])/sqrt(2),
y=(1-ww3[jis]-ww2[iis])/sqrt(2/3),
pch=20,col=grolor,cex=.3)


## a general framework for updating belief functions

Posted in Books, Statistics, University life with tags , , , , , , , , , on July 15, 2013 by xi'an

Pier Giovanni Bissiri, Chris Holmes and Stephen Walker have recently arXived the paper related to Sephen’s talk in London for Bayes 250. When I heard the talk (of which some slides are included below), my interest was aroused by the facts that (a) the approach they investigated could start from a statistics, rather than from a full model, with obvious implications for ABC, & (b) the starting point could be the dual to the prior x likelihood pair, namely the loss function. I thus read the paper with this in mind. (And rather quickly, which may mean I skipped important aspects. For instance, I did not get into Section 4 to any depth. Disclaimer: I wasn’t nor is a referee for this paper!)

The core idea is to stick to a Bayesian (hardcore?) line when missing the full model, i.e. the likelihood of the data, but wishing to infer about a well-defined parameter like the median of the observations. This parameter is model-free in that some degree of prior information is available in the form of a prior distribution. (This is thus the dual of frequentist inference: instead of a likelihood w/o a prior, they have a prior w/o a likelihood!) The approach in the paper is to define a “posterior” by using a functional type of loss function that balances fidelity to prior and fidelity to data. The prior part (of the loss) ends up with a Kullback-Leibler loss, while the data part (of the loss) is an expected loss wrt to l(THETASoEUR,x), ending up with the definition of a “posterior” that is

$\exp\{ -l(\theta,x)\} \pi(\theta)$

the loss thus playing the role of the log-likelihood.

I like very much the problematic developed in the paper, as I think it is connected with the real world and the complex modelling issues we face nowadays. I also like the insistence on coherence like the updating principle when switching former posterior for new prior (a point sorely missed in this book!) The distinction between M-closed M-open, and M-free scenarios is worth mentioning, if only as an entry to the Bayesian processing of pseudo-likelihood and proxy models. I am however not entirely convinced by the solution presented therein, in that it involves a rather large degree of arbitrariness. In other words, while I agree on using the loss function as a pivot for defining the pseudo-posterior, I am reluctant to put the same faith in the loss as in the log-likelihood (maybe a frequentist atavistic gene somewhere…) In particular, I think some of the choices are either hard or impossible to make and remain unprincipled (despite a call to the LP on page 7).  I also consider the M-open case as remaining unsolved as finding a convergent assessment about the pseudo-true parameter brings little information about the real parameter and the lack of fit of the superimposed model. Given my great expectations, I ended up being disappointed by the M-free case: there is no optimal choice for the substitute to the loss function that sounds very much like a pseudo-likelihood (or log thereof). (I thought the talk was more conclusive about this, I presumably missed a slide there!) Another great expectation was to read about the proper scaling of the loss function (since L and wL are difficult to separate, except for monetary losses). The authors propose a “correct” scaling based on balancing both faithfulness for a single observation, but this is not a completely tight argument (dependence on parametrisation and prior, notion of a single observation, &tc.)

The illustration section contains two examples, one of which is a full-size or at least challenging  genetic data analysis. The loss function is based on a logistic  pseudo-likelihood and it provides results where the Bayes factor is in agreement with a likelihood ratio test using Cox’ proportional hazard model. The issue about keeping the baseline function as unkown reminded me of the Robbins-Wasserman paradox Jamie discussed in Varanasi. The second example offers a nice feature of putting uncertainties onto box-plots, although I cannot trust very much the 95%  of the credibles sets. (And I do not understand why a unique loss would come to be associated with the median parameter, see p.25.)

Watch out: Tomorrow’s post contains a reply from the authors!

## ABC in 1984

Posted in Statistics with tags , , , , on November 9, 2009 by xi'an

“Bayesian statistics and Monte Carlo methods are ideally suited to the task of passing many models over one dataset” D. Rubin, Annals of Statistics, 1984

Jean-Louis Foulley sent me a 1984 paper by Don Rubin that details in no uncertain terms the accept-reject algorithm at the core of the ABC algorithm! Namely,

Generate $\theta\sim\pi(\theta)$;
Generate $x\sim f(x|\theta)$;
Accept $\theta$ if $x=x_0$

Obviously, ABC goes further by replacing the acceptance step with the tolerance condition

$d(x,x_0) < \epsilon$

but this early occurence is worth noticing nonetheless. It is also interesting to see that Don Rubin does not promote this simulation method in situations where the likelihood is not available but rather as an intuitive way to understanding posterior distributions from a frequentist perspective, because $\theta$‘s from the posterior are those that could have generated the observed data. (The issue of the zero probability of the exact equality between simulated and observed data is not dealt with in the paper, maybe because the notion of a “match” between simulated and observed data is not clearly defined.) Apart from this historical connection, I recommend the entire paper as providing a very compelling argument for practical Bayesianism!