**A**fter a rather extended shelf-life, our paper *expectation propagation as a way of life: a framework for Bayesian inference on partitioned data *which was started when Andrew visited Paris in… 2014!, and to which I only marginally contributed, has now appeared in JMLR! Which happens to be my very first paper in this journal.

## Archive for expectation-propagation

## Expectation Propagation as a Way of Life on-line

Posted in pictures, Statistics, University life with tags Andrew Gelman, Bayesian computation, Bayesian inference, big data, distributed Bayesian inference, Edward Hopper, expectation-propagation, gas, gas station, JMLR, Journal of Machine-Learning, New York, The Museum of Modern Art, way of life on March 18, 2020 by xi'an## 7 years later…

Posted in Statistics with tags Andrew Gelman, Bayesian computation, divide-and-conquer strategy, expectation-propagation, JMLR, Journal of Machine-Learning, Paris on February 20, 2020 by xi'an## likelihood-free approximate Gibbs sampling

Posted in Books, Statistics with tags ABC, ABC-Gibbs, ABC-within-Gibbs, curse of dimensionality, expectation-propagation, Gibbs sampling, local regression, neural network, summary statistics on June 19, 2019 by xi'an

“Low-dimensional regression-based models are constructed for each of these conditional distributions using synthetic (simulated) parameter value and summary statistic pairs, which then permit approximate Gibbs update steps (…) synthetic datasets are not generated during each sampler iteration, thereby providing efficiencies for expensive simulator models, and only require sufficient synthetic datasets to adequately construct the full conditional models (…) Construction of the approximate conditional distributions can exploit known structures of the high-dimensional posterior, where available, to considerably reduce computational overheads”

**G**uilherme Souza Rodrigues, David Nott, and Scott Sisson have just arXived a paper on approximate Gibbs sampling. Since this comes a few days after we posted our own version, here are some of the differences I could spot in the paper:

- Further references to earlier occurrences of Gibbs versions of ABC, esp. in cases when the likelihood function factorises into components and allows for summaries with lower dimensions. And even to ESP.
- More an ABC version of Gibbs sampling that a Gibbs version of ABC in that approximations to the conditionals are first constructed and then used with no further corrections.
- Inherently related to regression post-processing à la Beaumont et al. (2002) in that the regression model is the start to designing an approximate full conditional, conditional on the “other” parameters and on the overall summary statistic. The construction of the approximation is far from automated. And may involve neural networks or other machine learning estimates.
- As a consequence of the above, a preliminary ABC step to design the collection of approximate full conditionals using a single and all-purpose multidimensional summary statistic.
- Once the approximations constructed, no further pseudo-data is generated.
- Drawing from the approximate full conditionals is done exactly, possibly via a bootstrapped version.
- Handling a highly complex g-and-k dynamic model with 13,140 unknown parameters, requiring a ten days simulation.

“In certain circumstances it can be seen that the likelihood-free approximate Gibbs sampler will exactly target the true partial posterior (…) In this case, then Algorithms 2 and 3 will be exact.”

Convergence and coherence are handled in the paper by setting the algorithm(s) as noisy Monte Carlo versions, à la Alquier et al., although the issue of incompatibility between the full conditionals is acknowledged, with the main reference being the finite state space analysis of Chen and Ip (2015). It thus remains unclear whether or not the Gibbs samplers that are implemented there do converge and if they do what is the significance of the stationary distribution.

## X divergence for approximate inference

Posted in Statistics with tags adaptive importance sampling, divergence, expectation-propagation, Kullback-Leibler divergence, Pima Indians, population Monte Carlo, variational Bayes methods, Wasserstein distance on March 14, 2017 by xi'an**D**ieng et al. arXived this morning a new version of their paper on using the Χ divergence for variational inference. The Χ divergence essentially is the expectation of the squared ratio of the target distribution over the approximation, under the approximation. It is somewhat related to Expectation Propagation (EP), which aims at the Kullback-Leibler divergence between the target distribution and the approximation, under the target. And to variational Bayes, which is the same thing just the opposite way! The authors also point a link to our [adaptive] population Monte Carlo paper of 2008. (I wonder at a possible version through Wasserstein distance.)

Some of the arguments in favour of this new version of variational Bayes approximations is that (a) the support of the approximation over-estimates the posterior support; (b) it produces over-dispersed versions; (c) it relates to a well-defined and global objective function; (d) it allows for a sandwich inequality on the model evidence; (e) the function of the [approximation] parameter to be minimised is under the approximation, rather than under the target. The latest allows for a gradient-based optimisation. While one of the applications is on a Bayesian probit model applied to the Pima Indian women dataset [and will thus make James and Nicolas cringe!], the experimental assessment shows lower error rates for this and other benchmarks. Which in my opinion does not tell so much about the original Bayesian approach.

## an attempt at EP-ABC from scratch, nothing more… [except for a few bugs]

Posted in R, Statistics, University life with tags ABC, computer experiment, expectation-propagation, Gaussian approximation, implementation, Monte Carlo Statistical Methods, R on October 19, 2016 by xi'an**F**ollowing a request from one of the reviewers of our chapter Likelihood-free model choice, I tried to run EP-ABC on a toy problem and to compare it with the outcome of a random forest ABC. Literally starting from scratch, namely from the description found in Simon and Nicolas’ JASA paper. To run my test, I chose as my modelled data an exponential Exp(λ) sample of size 20, with a Gaussian N(0,1) prior on the log parameter (here replicated 100 times):

n=20 trueda=matrix(rexp(100*n),ncol=n) #prior values er0=0 Q0=1

Then I applied the formulas found in the paper for approximating the evidence, first defining the log normalising constant for an unnormalised Gaussian density as on the top of p.318

Psi=function(er,Q){ -.5*(log(Q/2/pi)-Q*er*er)}

[which exhibited a typo in the paper, as Simon Barthelmé figured out after emails exchanges that the right formulation was the inverse]

Psi=function(er,Q){ -.5*(log(Q/2/pi)-er*er/Q)}

and iterating the site updates as described in Sections 2.2, 3.1 and Algorithms 1 and 3:

ep=function(dat,M=1e4,eps,Nep=10){ n=length(dat) #global and ith natural parameters for Gaussian approximations #initialisation: Qi=rep(0,n) Q=Q0+sum(Qi) eri=rep(0,n) er=er0+sum(eri) #constants Ci in (2.6) normz=rep(1,n) for (t in 1:Nep){ for (i in sample(1:n)){ #site i update Qm=Q-Qi[i] #m for minus i erm=er-eri[i] #ABC sample thetas=rnorm(M,me=erm/Qm,sd=1/sqrt(Qm)) dati=rexp(M,rate=exp(thetas)) #update by formula (3.3) Mh=sum((abs(dati-dat[i])<eps)) mu=mean(thetas[abs(dati-dat[i])<eps]) sig=var(thetas[abs(dati-dat[i])<eps]) if (Mh>1e3){ #enough proposals accepted Qi[i]=1/sig-Qm eri[i]=mu/sig-erm Q=Qm+Qi[i] er=erm+eri[i] #update of Ci as in formula (2.7) #with correction bottom of p.319 normz[i]=log(Mh/M/2/eps)- Psi(er,Q)+Psi(erm,Qm) }}} #approximation of the log evidence as on p.318 return(sum(normz)+Psi(er,Q)-Psi(er0,Q0)) }

except that I made an R beginner’s mistake (!) when calling the normal simulation to be

thetas=rnorm(M,me=erm/Qm)/sqrt(Qm)

to be compared with the genuine evidence under a conjugate Exp(1) prior *[values of the evidence should differ but should not differ that much]*

ev=function(dat){ n=length(dat) lgamma(n+1)-(n+1)*log(1+sum(dat)) }

After correcting for those bugs and too small sample sizes, thanks to Simon kind-hearted help!, running this code results in minor discrepancies between both quantities:

evid=ovid=rep(0,100) M=1e4 #number of ABC samples propep=.1 #tolerance factor for (t in 1:100){ #tolerance scaled by data epsi=propep*sd(trueda[t,]) evid[t]=ep(trueda[t,],M,epsi) ovid[t]=ev(trueda[t,]) } plot(evid,ovid,cex=.6,pch=19)

as shown by the comparison below between the evidence and the EP-ABC approximations to the evidence, called *epabence* (the dashed line is the diagonal):

Obviously, this short (if lengthy at my scale) experiment is not intended to conclude that EP-ABC work or does not work. (Simon also provided additional comparison with the genuine evidence, that is under the right prior, and with the zero Monte-Carlo version of the EP-ABC evidence, achieving a high concentration over the diagonal.) I just conclude that the method does require a certain amount of calibration to become stable. Calibrations steps that are clearly spelled out in the paper (adaptive choice of M, stabilisation by quasi-random samples, stabilisation by stochastic optimisation steps and importance sampling), but also imply that EP-ABC is also “far from routine use because it make take days to tune on real problems” (pp.315-316). Thinking about the reasons for such discrepancy over the past days, one possible explanation I see for the impact of the calibration parameters is that the validation of EP if any occurs at a numerical rather than Monte Carlo level, hence that the Monte Carlo error must be really small to avoid interfering with the numerical aspects.