## an attempt at EP-ABC from scratch, nothing more… [except for a few bugs]

Posted in R, Statistics, University life with tags , , , , , , on October 19, 2016 by xi'an

Following a request from one of the reviewers of our chapter Likelihood-free model choice, I tried to run EP-ABC on a toy problem and to compare it with the outcome of a random forest ABC. Literally starting from scratch, namely from the description found in Simon and Nicolas’ JASA paper.  To run my test, I chose as my modelled data an exponential Exp(λ) sample of size 20, with a Gaussian N(0,1) prior on the log parameter (here replicated 100 times):

```n=20
trueda=matrix(rexp(100*n),ncol=n)
#prior values
er0=0
Q0=1
```

Then I applied the formulas found in the paper for approximating the evidence, first defining the log normalising constant for an unnormalised Gaussian density as on the top of p.318

```Psi=function(er,Q){
-.5*(log(Q/2/pi)-Q*er*er)}
```

[which exhibited a typo in the paper, as Simon Barthelmé figured out after emails exchanges that the right formulation was the inverse]

```Psi=function(er,Q){
-.5*(log(Q/2/pi)-er*er/Q)}
```

and iterating the site updates as described in Sections 2.2, 3.1 and Algorithms 1 and 3:

```ep=function(dat,M=1e4,eps,Nep=10){
n=length(dat)
#global and ith natural parameters for Gaussian approximations
#initialisation:
Qi=rep(0,n)
Q=Q0+sum(Qi)
eri=rep(0,n)
er=er0+sum(eri)
#constants Ci in (2.6)
normz=rep(1,n)
for (t in 1:Nep){
for (i in sample(1:n)){
#site i update
Qm=Q-Qi[i] #m for minus i
erm=er-eri[i]
#ABC sample
thetas=rnorm(M,me=erm/Qm,sd=1/sqrt(Qm))
dati=rexp(M,rate=exp(thetas))
#update by formula (3.3)
Mh=sum((abs(dati-dat[i])<eps))
mu=mean(thetas[abs(dati-dat[i])<eps])
sig=var(thetas[abs(dati-dat[i])<eps])
if (Mh>1e3){
#enough proposals accepted
Qi[i]=1/sig-Qm
eri[i]=mu/sig-erm
Q=Qm+Qi[i]
er=erm+eri[i]
#update of Ci as in formula (2.7)
#with correction bottom of p.319
normz[i]=log(Mh/M/2/eps)- Psi(er,Q)+Psi(erm,Qm)
}}}
#approximation of the log evidence as on p.318
return(sum(normz)+Psi(er,Q)-Psi(er0,Q0)) }
```

except that I made an R beginner’s mistake (!) when calling the normal simulation to be

```thetas=rnorm(M,me=erm/Qm)/sqrt(Qm)
```

to be compared with the genuine evidence under a conjugate Exp(1) prior [values of the evidence should differ but should not differ that much]

```ev=function(dat){
n=length(dat)
lgamma(n+1)-(n+1)*log(1+sum(dat))
}
```

After correcting for those bugs and too small sample sizes, thanks to Simon kind-hearted help!, running this code results in minor discrepancies between both quantities:

```evid=ovid=rep(0,100)
M=1e4 #number of ABC samples
propep=.1 #tolerance factor
for (t in 1:100){
#tolerance scaled by data
epsi=propep*sd(trueda[t,])
evid[t]=ep(trueda[t,],M,epsi)
ovid[t]=ev(trueda[t,])
}
plot(evid,ovid,cex=.6,pch=19)
```

as shown by the comparison below between the evidence and the EP-ABC approximations to the evidence, called epabence (the dashed line is the diagonal):

Obviously, this short (if lengthy at my scale) experiment is not intended to conclude that EP-ABC work or does not work. (Simon also provided additional comparison with the genuine evidence, that is under the right prior, and with the zero Monte-Carlo version of the EP-ABC evidence, achieving a high concentration over the diagonal.) I just conclude that the method does require a certain amount of calibration to become stable. Calibrations steps that are clearly spelled out in the paper (adaptive choice of M, stabilisation by quasi-random samples, stabilisation by stochastic optimisation steps and importance sampling), but also imply that EP-ABC is also “far from routine use because it make take days to tune on real problems” (pp.315-316). Thinking about the reasons for such discrepancy over the past days, one possible explanation I see for the impact of the calibration parameters is that the validation of EP if any occurs at a numerical rather than Monte Carlo level, hence that the Monte Carlo error must be really small to avoid interfering with the numerical aspects.

## scalable Langevin exact algorithm

Posted in Books, Statistics, Travel, University life with tags , , , , , , , , , , , on October 18, 2016 by xi'an

“By employing a modification to existing naïve subsampling techniques we can obtain an algorithm which is still exact but has sub-linear iterative cost as a function of data size.”

A few weeks ago Murray Pollock, Paul Fearnhead, Adam Johansen and Gareth Roberts (all from Warwick except for Paul) arXived a paper The Scalable Langevin Exact Algorithm: Bayesian Inference for Big Data. (This was also the topic of Murray’s talk last year at JSM in Seattle.) One major advance found in the paper is the derivation of an “exact” algorithm that is sub-linear in the data size. As discussed in the introduction, the current approaches to large data problems either suffer from being approximate (like divide-and-conquer methods) or do not achieve significant reduction in the computing time, being of order O(n). The authors mention Teh and Welling (2011) sand their tochastic gradient approximation to the Langevin diffusion, when the gradient is based on a subsample. Without the Metropolis correction that would ensure an exact target but at a cost of order O(n). (Which makes the technique rather difficult to recommend.)

A novel [for me] notion at the core of this paper is the concept of quasi-stationary distribution, which is the limiting distribution of a Markov chain X[t] conditional on a Markov stopping time [being larger than t]. The approach is based on diffusions with appropriate stationary distributions like the Langevin diffusion. (Actually, as in most papers I have read and remember, the current paper only considers the Langevin diffusion.) In order to avoid the issues with unadjusted and Metropolis-adjusted Langevin schemes, a killed Brownian motion is created, which means a Brownian motion conditional of being alive till time T when the instantaneous killing rate is a given function of the chain, Φ(X[t]), related with the stationary measure of the Langevin diffusion ν. Under appropriate conditions, the density of this killed Brownian motion converges [in T] to √ν. Which immediately hints at creating a new Langevin diffusion targeting ν² instead of ν. And killing it with the proper rate, which can be done by thinning a Poisson process. Simulating the original path can be done by path-space rejection sampling, following the technique set by Gareth Roberts and co-authors more than ten years ago. Based on finite dimensional realisations of the path on [0,T]. And including the killing part can be done by importance sampling and checking that the simulated killing time is larger than the current (exponentially simulated) time.

One practical difficulty in the implementation of this neat principle is the derivation of the normalising constant, which evaluation degrades with the time horizon T. The solution adopted in the paper is through a sequential Monte Carlo method, using another discretisation of the time interval [0,T] (relying on the original one would get too costly?). As for subsampling, since the survival probability for the Brownian motion is based on an unbiased estimator, subsampling does not hurt if conducted in a random manner. Although this increases the variance on principle, the use of a control variate computed just once helps in reducing the complexity to O(1).

This is a tough paper and I have not gone through the effort of trying to implement it, but this is an original and innovative construct I would like to monitor in further details on a toy example, maybe next week while in Warwick. Or at least to discuss it with the authors.

## tractable Bayesian variable selection: beyond normality

Posted in R, Statistics, University life with tags , , , , , , , on October 17, 2016 by xi'an

David Rossell and Francisco Rubio (both from Warwick) arXived a month ago a paper on non-normal variable selection. They use two-piece error models that preserve manageable inference and allow for simple computational algorithms, but also characterise the behaviour of the resulting variable selection process under model misspecification. Interestingly, they show that the existence of asymmetries or heavy tails leads to power losses when using the Normal model. The two-piece error distribution is made of two halves of location-scale transforms of the same reference density on the two sides of the common location parameter. In this paper, the density is either Gaussian or Laplace (i.e., exponential?). In both cases the (log-)likelihood has a nice compact expression (although it does not allow for a useful sufficient statistic). One is the L¹ version versus the other which is the L² version. Which is the main reason for using this formalism based on only two families of parametric distributions, I presume. (As mentioned in an earlier post, I do not consider those distributions as mixtures because the component of a given observation can always be identified. And because as shown in the current paper, maximum likelihood estimates can be easily derived.) The prior construction follows the non-local prior principles of Johnson and Rossell (2010, 2012) also discussed in earlier posts. The construction is very detailed and hence highlights how many calibration steps are needed in the process.

“Bayes factor rates are the same as when the correct model is assumed [but] model misspecification often causes a decrease in the power to detect truly active variables.”

When there are too many models to compare at once, the authors propose a random walk on the finite set of models (which does not require advanced measure-theoretic tools like reversible jump MCMC). One interesting aspect is that moving away from the normal to another member of this small family is driven by the density of the data under the marginal densities, which means moving only to interesting alternatives. But also sticking to the normal only for adequate datasets. In a sense this is not extremely surprising given that the marginal likelihoods (model-wise) are available. It is also interesting that on real datasets, one of the four models is heavily favoured against the others, be it Normal (6.3) or Laplace (6.4). And that the four model framework returns almost identical values when compared with a single (most likely) model. Although not immensely surprising when acknowledging that the frequency of the most likely model is 0.998 and 0.998, respectively.

“Our framework represents a middle-ground to add flexibility in a parsimonious manner that remains analytically and computationally tractable, facilitating applications where either p is large or n is too moderate to fit more flexible models accurately.”

Overall, I find the experiment quite conclusive and do not object [much] to this choice of parametric family in that it is always more general and generic than the sempiternal Gaussian model. That we picked in our Bayesian Essentials, following tradition. In a sense, it would be natural to pick the most general possible parametric family that allows for fast computations, if this notion does make any sense…

## Norman sunrise [jatp]

Posted in pictures, Running, Travel with tags , , , , , on October 16, 2016 by xi'an

## Nature highlights

Posted in Books, Kids, pictures, Statistics with tags , , , , , , , , , , , , , on October 16, 2016 by xi'an

## Only in Britain…

Posted in Kids, pictures, Travel, University life with tags , , , , , on October 15, 2016 by xi'an

A recent announcement on the University of Warwick official website:

Today the Minister for the Constitution, Chris Skidmore, presented WMG, at the University of Warwick, with a Royal Warrant signed by Her Majesty the Queen, officially conferring her Majesty’s recognition with the title of the Regius Professor of Manufacturing (Engineering).

The title of Regius Professorship is a rare and prestigious award given by Her Majesty the Queen to recognise exceptionally high quality research at an institution. The University of Warwick was one of 12 universities honoured to mark Her Majesty’s 90th Birthday. Previous to this, only 14 had been granted since the reign of Queen Victoria. It is believed that the first Regius Professorship was conferred to Aberdeen University in 1497 by King James IV.

## Hastings without Metropolis

Posted in Books, Kids, Travel with tags , , , , , , , on October 14, 2016 by xi'an

Today marks the 950th anniversary of the battle of Hastings, when Guillaume, Duke of Normandy and pretendent to the throne of England following the death of the childless King Edward the Confessor in January 1066, defeated Harold Godwinson, recently elected King and even more recently the victor of a battle against another pretended, Harald Hardrada of Norway. (I had always thought that Guillaume had fought the battle the day his fleet of hundreds of Viking drakkars landed, but he arrived two weeks earlier and had time to build a fort in Hastings.) One of the consequences of this victory would be significant changes in the structure and vocabulary of the English language. [One may wonder at why I am mentioning this anniversary but been “born and raised” in the heart of Guillaume’s Norman kingdom prompted some long-lasting interest in the Norman invasion.]