## likelihood-free approximate Gibbs sampling

Posted in Books, Statistics with tags , , , , , , , , on June 19, 2019 by xi'an

“Low-dimensional regression-based models are constructed for each of these conditional distributions using synthetic (simulated) parameter value and summary statistic pairs, which then permit approximate Gibbs update steps (…) synthetic datasets are not generated during each sampler iteration, thereby providing efficiencies for expensive simulator models, and only require sufficient synthetic datasets to adequately construct the full conditional models (…) Construction of the approximate conditional distributions can exploit known structures of the high-dimensional posterior, where available, to considerably reduce computational overheads”

Guilherme Souza Rodrigues, David Nott, and Scott Sisson have just arXived a paper on approximate Gibbs sampling. Since this comes a few days after we posted our own version, here are some of the differences I could spot in the paper:

1. Further references to earlier occurrences of Gibbs versions of ABC, esp. in cases when the likelihood function factorises into components and allows for summaries with lower dimensions. And even to ESP.
2. More an ABC version of Gibbs sampling that a Gibbs version of ABC in that approximations to the conditionals are first constructed and then used with no further corrections.
3. Inherently related to regression post-processing à la Beaumont et al.  (2002) in that the regression model is the start to designing an approximate full conditional, conditional on the “other” parameters and on the overall summary statistic. The construction of the approximation is far from automated. And may involve neural networks or other machine learning estimates.
4. As a consequence of the above, a preliminary ABC step to design the collection of approximate full conditionals using a single and all-purpose multidimensional summary statistic.
5. Once the approximations constructed, no further pseudo-data is generated.
6. Drawing from the approximate full conditionals is done exactly, possibly via a bootstrapped version.
7. Handling a highly complex g-and-k dynamic model with 13,140 unknown parameters, requiring a ten days simulation.

“In certain circumstances it can be seen that the likelihood-free approximate Gibbs sampler will exactly target the true partial posterior (…) In this case, then Algorithms 2 and 3 will be exact.”

Convergence and coherence are handled in the paper by setting the algorithm(s) as noisy Monte Carlo versions, à la Alquier et al., although the issue of incompatibility between the full conditionals is acknowledged, with the main reference being the finite state space analysis of Chen and Ip (2015). It thus remains unclear whether or not the Gibbs samplers that are implemented there do converge and if they do what is the significance of the stationary distribution.

## X divergence for approximate inference

Posted in Statistics with tags , , , , , , , on March 14, 2017 by xi'an Dieng et al. arXived this morning a new version of their paper on using the Χ divergence for variational inference. The Χ divergence essentially is the expectation of the squared ratio of the target distribution over the approximation, under the approximation. It is somewhat related to Expectation Propagation (EP), which aims at the Kullback-Leibler divergence between the target distribution and the approximation, under the target. And to variational Bayes, which is the same thing just the opposite way! The authors also point a link to our [adaptive] population Monte Carlo paper of 2008. (I wonder at a possible version through Wasserstein distance.)

Some of the arguments in favour of this new version of variational Bayes approximations is that (a) the support of the approximation over-estimates the posterior support; (b) it produces over-dispersed versions; (c) it relates to a well-defined and global objective function; (d) it allows for a sandwich inequality on the model evidence; (e) the function of the [approximation] parameter to be minimised is under the approximation, rather than under the target. The latest allows for a gradient-based optimisation. While one of the applications is on a Bayesian probit model applied to the Pima Indian women dataset [and will thus make James and Nicolas cringe!], the experimental assessment shows lower error rates for this and other benchmarks. Which in my opinion does not tell so much about the original Bayesian approach.

## an attempt at EP-ABC from scratch, nothing more… [except for a few bugs]

Posted in R, Statistics, University life with tags , , , , , , on October 19, 2016 by xi'an

Following a request from one of the reviewers of our chapter Likelihood-free model choice, I tried to run EP-ABC on a toy problem and to compare it with the outcome of a random forest ABC. Literally starting from scratch, namely from the description found in Simon and Nicolas’ JASA paper.  To run my test, I chose as my modelled data an exponential Exp(λ) sample of size 20, with a Gaussian N(0,1) prior on the log parameter (here replicated 100 times):

```n=20
trueda=matrix(rexp(100*n),ncol=n)
#prior values
er0=0
Q0=1
```

Then I applied the formulas found in the paper for approximating the evidence, first defining the log normalising constant for an unnormalised Gaussian density as on the top of p.318

```Psi=function(er,Q){
-.5*(log(Q/2/pi)-Q*er*er)}
```

[which exhibited a typo in the paper, as Simon Barthelmé figured out after emails exchanges that the right formulation was the inverse]

```Psi=function(er,Q){
-.5*(log(Q/2/pi)-er*er/Q)}
```

and iterating the site updates as described in Sections 2.2, 3.1 and Algorithms 1 and 3:

```ep=function(dat,M=1e4,eps,Nep=10){
n=length(dat)
#global and ith natural parameters for Gaussian approximations
#initialisation:
Qi=rep(0,n)
Q=Q0+sum(Qi)
eri=rep(0,n)
er=er0+sum(eri)
#constants Ci in (2.6)
normz=rep(1,n)
for (t in 1:Nep){
for (i in sample(1:n)){
#site i update
Qm=Q-Qi[i] #m for minus i
erm=er-eri[i]
#ABC sample
thetas=rnorm(M,me=erm/Qm,sd=1/sqrt(Qm))
dati=rexp(M,rate=exp(thetas))
#update by formula (3.3)
Mh=sum((abs(dati-dat[i])<eps))
mu=mean(thetas[abs(dati-dat[i])<eps])
sig=var(thetas[abs(dati-dat[i])<eps])
if (Mh>1e3){
#enough proposals accepted
Qi[i]=1/sig-Qm
eri[i]=mu/sig-erm
Q=Qm+Qi[i]
er=erm+eri[i]
#update of Ci as in formula (2.7)
#with correction bottom of p.319
normz[i]=log(Mh/M/2/eps)- Psi(er,Q)+Psi(erm,Qm)
}}}
#approximation of the log evidence as on p.318
return(sum(normz)+Psi(er,Q)-Psi(er0,Q0)) }
```

except that I made an R beginner’s mistake (!) when calling the normal simulation to be

```thetas=rnorm(M,me=erm/Qm)/sqrt(Qm)
```

to be compared with the genuine evidence under a conjugate Exp(1) prior [values of the evidence should differ but should not differ that much]

```ev=function(dat){
n=length(dat)
lgamma(n+1)-(n+1)*log(1+sum(dat))
}
```

After correcting for those bugs and too small sample sizes, thanks to Simon kind-hearted help!, running this code results in minor discrepancies between both quantities:

```evid=ovid=rep(0,100)
M=1e4 #number of ABC samples
propep=.1 #tolerance factor
for (t in 1:100){
#tolerance scaled by data
epsi=propep*sd(trueda[t,])
evid[t]=ep(trueda[t,],M,epsi)
ovid[t]=ev(trueda[t,])
}
plot(evid,ovid,cex=.6,pch=19)
```

as shown by the comparison below between the evidence and the EP-ABC approximations to the evidence, called epabence (the dashed line is the diagonal): Obviously, this short (if lengthy at my scale) experiment is not intended to conclude that EP-ABC work or does not work. (Simon also provided additional comparison with the genuine evidence, that is under the right prior, and with the zero Monte-Carlo version of the EP-ABC evidence, achieving a high concentration over the diagonal.) I just conclude that the method does require a certain amount of calibration to become stable. Calibrations steps that are clearly spelled out in the paper (adaptive choice of M, stabilisation by quasi-random samples, stabilisation by stochastic optimisation steps and importance sampling), but also imply that EP-ABC is also “far from routine use because it make take days to tune on real problems” (pp.315-316). Thinking about the reasons for such discrepancy over the past days, one possible explanation I see for the impact of the calibration parameters is that the validation of EP if any occurs at a numerical rather than Monte Carlo level, hence that the Monte Carlo error must be really small to avoid interfering with the numerical aspects.

## advanced computational methods for complex models in Biology [talk]

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , on September 29, 2016 by xi'an Here are the slides of the presentation I gave at the EPSRC Advanced Computational methods for complex models in Biology at University College London, last week. Introducing random forests as proper summaries for both model choice and parameter estimation (with considerable overlap with earlier slides, obviously!). The other talks of that highly interesting day on computational Biology were mostly about ancestral graphs, using Wright-Fisher diffusions for coalescents, plus a comparison of expectation-propagation and ABC on a genealogy model by Mark Beaumont and the decision theoretic approach to HMM order estimation by Chris Holmes. In addition, it gave me the opportunity to come back to the Department of Statistics at UCL more than twenty years after my previous visit, at a time when my friend Costas Goutis was still there. And to realise it had moved from its historical premises years ago. (I wonder what happened to the two staircases built to reduce frictions between Fisher and Pearson if I remember correctly…)

## approximate Bayesian inference

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , on March 23, 2016 by xi'an

Maybe it is just a coincidence, but both most recent issues of Bayesian Analysis have an article featuring approximate Bayesian inference. One is by Daniel Graham and co-authors on Approximate Bayesian Inference for Doubly Robust Estimation, while the other one is by Chris Drovandi and co-authors from QUT on Exact and Approximate Bayesian Inference for Low Integer-Valued Time Series Models with Intractable Likelihoods. The first paper has little connection with ABC. Even though it (a) uses a lot of three letter acronyms [which does not help with speed reading] and (b) relies on moment based and propensity score models. Instead, it relies on Bayesian bootstrap, which suddenly seems to me to be rather connected with empirical likelihood! Except the weights are estimated via a Dirichlet prior instead of being optimised. The approximation lies in using the bootstrap to derive a posterior predictive. I did not spot any assessment or control of the approximation effect in the paper. “Note that we are always using the full data so avoiding the need to choose a summary statistic” (p.326)

The second paper connects pMCMC with ABC. Plus pseudo-marginals on the side! And even simplified reversible jump MCMC!!! I am far from certain I got every point of the paper, though, especially the notion of dimension reduction associated with this version of reversible jump MCMC. It may mean that latent variables are integrated out in approximate (marginalised) likelihoods [as explicated in Andrieu and Roberts (2009)].

“The difference with the common ABC approach is that we match on observations one-at-a-time” (p.328)

The model that the authors study is an integer value time-series, like the INAR(p) model. Which integer support allows for a non-zero probability of exact matching between simulated and observed data. One-at-a-time as indicated in the above quote. And integer valued tolerances like ε=1 otherwise. In the case auxiliary variables are necessary, the authors resort to the alive particle filter of Jasra et al. (2013), which main point is to produce an unbiased estimate of the (possibly approximate) likelihood, to be exploited by pseudo-marginal techniques. However, unbiasedness sounds less compelling when moving to approximate methods, as illustrated by the subsequent suggestion to use a more stable estimate of the log-likelihood. In fact, when the tolerance ε is positive, the pMCMC acceptance probability looks quite close to an ABC-MCMC probability when relying on several pseudo-data simulations. Which is unbiased for the “right” approximate target. A fact that may actually holds for all ABC algorithms. One quite interesting aspect of the paper is its reflection about the advantage of pseudo-marginal techniques for RJMCMC algorithms since they allow for trans-dimension moves to be simplified, as they consider marginals on the space of interest. Up to this day, I had not realised Andrieu and Roberts (2009) had a section on this aspect… I am still unclear about the derivation of the posterior probabilities of the models under comparison, unless it is a byproduct of the RJMCMC algorithm. A last point is that, for some of the Markov models used in the paper, the pseudo observations can be produced as a random one-time move away from the current true observation, which makes life much easier for ABC and explain why exact simulations can sometimes be produced. (A side note: the authors mention on p.326 that EP is only applicable when the posterior is from an exponential family, while my understanding is that it uses an exponential family to approximate the true posterior.)

## at CIRM [#3]

Posted in Kids, Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , on March 4, 2016 by xi'an Simon Barthelmé gave his mini-course on EP, with loads of details on the implementation of the method. Focussing on the EP-ABC and MCMC-EP versions today. Leaving open the difficulty of assessing to which limit EP is converging. But mentioning the potential for asynchronous EP (on which I would like to hear more). Ironically using several times a logistic regression example, if not on the Pima Indians benchmark! He also talked about approximate EP solutions that relate to consensus MCMC. With a connection to Mark Beaumont’s talk at NIPS [at the time as mine!] on the comparison with ABC. While we saw several talks on EP during this week, I am still agnostic about the potential of the approach. It certainly produces a fast proxy to the true posterior and hence can be exploited ad nauseam in inference methods based on pseudo-models like indirect inference. In conjunction with other quick and dirty approximations when available. As in ABC, it would be most useful to know how far from the (ideal) posterior distribution does the approximation stands. Machine learning approaches presumably allow for an evaluation of the predictive performances, but less so for the modelling accuracy, even with new sampling steps. [But I know nothing, I know!]

Dennis Prangle presented some on-going research on high dimension [data] ABC. Raising the question of what is the true meaning of dimension in ABC algorithms. Or of sample size. Because the inference relies on the event d(s(y),s(y’))≤ξ or on the likelihood l(θ|x). Both one-dimensional. Mentioning Iain Murray’s talk at NIPS [that I also missed]. Re-expressing as well the perspective that ABC can be seen as a missing or estimated normalising constant problem as in Bornn et al. (2015) I discussed earlier. The central idea is to use SMC to simulate a particle cloud evolving as the target tolerance ξ decreases. Which supposes a latent variable structure lurking in the background. Judith Rousseau gave her talk on non-parametric mixtures and the possibility to learn parametrically about the component weights. Starting with a rather “magic” result by Allman et al. (2009) that three repeated observations per individual, all terms in a mixture are identifiable. Maybe related to that simpler fact that mixtures of Bernoullis are not identifiable while mixtures of Binomial are identifiable, even when n=2. As “shown” in this plot made for X validated. Actually truly related because Allman et al. (2009) prove identifiability through a finite dimensional model. (I am surprised I missed this most interesting paper!) With the side condition that a mixture of p components made of r Bernoulli products is identifiable when p ≥ 2[log² r] +1, when log² is base 2-logarithm. And [x] the upper rounding. I also find most relevant this distinction between the weights and the remainder of the mixture as weights behave quite differently, hardly parameters in a sense.

## expectation-propagation from Les Houches

Posted in Books, Mountains, pictures, Statistics, University life with tags , , , , , , , , , , on February 3, 2016 by xi'an  As CHANCE book editor, I received the other day from Oxford University Press acts from an École de Physique des Houches on Statistical Physics, Optimisation, Inference, and Message-Passing Algorithms that took place there in September 30 – October 11, 2013.  While it is mostly unrelated with Statistics, and since Igor Caron already reviewed the book a year and more ago, I skimmed through the few chapters connected to my interest, from Devavrat Shah’s chapter on graphical models and belief propagation, to Andrea Montanari‘s denoising and sparse regression, including LASSO, and only read in some detail Manfred Opper’s expectation propagation chapter. This paper made me realise (or re-realise as I had presumably forgotten an earlier explanation!) that expectation propagation can be seen as a sort of variational approximation that produces by a sequence of iterations the distribution within a certain parametric (exponential) family that is the closest to the distribution of interest. By writing the Kullback-Leibler divergence the opposite way from the usual variational approximation, the solution equates the expectation of the natural sufficient statistic under both models… Another interesting aspect of this chapter is the connection with estimating normalising constants. (I noticed a slight typo on p.269 in the final form of the Kullback approximation q() to p(). 