## Archive for the University life Category

Posted in Books, Statistics, University life with tags , , , , , , , , , , on October 27, 2016 by xi'an

In the March 2016 issue of JASA that currently sits on my desk, there is a paper by Liang, Jim, Song and Liu on the adaptive exchange algorithm, which aims at handling posteriors for sampling distributions with intractable normalising constants. The concept behind the algorithm is the exchange principle initiated by Jesper Møller and co-authors in 2006, where an auxiliary pseudo-observation is simulated for the missing constants to vanish in a Metropolis-Hastings ratio. (The name exchangeable was introduced in a subsequent paper by Iain Murray, Zoubin Ghahramani and David MacKay, also in 2006.)

The crux of the method is to run an iteration as [where y denotes the observation]

1. Proposing a new value θ’ of the parameter from a proposal q(θ’|θ);
2. Generate a pseudo-observation z~ƒ(z|θ’);
3. Accept with probability

$\dfrac{\pi(\theta')f(y|\theta')}{\pi(\theta)f(y|\theta)}\dfrac{q(\theta|\theta')f(z|\theta)}{q(\theta'|\theta)f(z|\theta')}$

which has the appeal to cancel all normalising constants. And the repeal of requiring an exact simulation from the very distribution with the missing constant, ƒ(.|θ). Which means that in practice a finite number of MCMC steps will be used and will bias the outcome. The algorithm is unusual in that it replaces the exact proposal q(θ’|θ) with an unbiased random version q(θ’|θ)ƒ(z|θ’), z being just an augmentation of the proposal. (The current JASA paper by Liang et al. seems to confuse augment and argument, see p.378.)

To avoid the difficulty in simulating from ƒ(.|θ), the authors draw pseudo-observations from sampling distributions with a finite number m of parameter values under the [unrealistic] assumption (A⁰) that this collection of values provides an almost complete cover of the posterior support. One of the tricks stands with an auxiliary [time-heterogeneous] chain of pseudo-observations generated by single Metropolis steps from one of these m fixed targets. These pseudo-observations are then used in the main (or target) chain to define the above exchange probability. The auxiliary chain is Markov but time-heterogeneous since the probabilities of accepting a move are evolving with time according to a simulated annealing schedule. Which produces a convergent estimate of the m normalising constants. The main chain is not Markov in that it depends on the whole history of the auxiliary chain [see Step 5, p.380]. Even jointly the collection of both chains is not Markov. The paper prefers to consider the process as an adaptive Markov chain. I did not check the rather intricate in details, so cannot judge of the validity of the overall algorithm; I simply note that one condition (A², p.383) is incredibly strong in that it assumes the Markov transition kernel to be Doeblin uniformly on any compact set of the calibration parameters. However, the major difficulty with this approach seems to be in its delicate calibration. From providing a reference set of m parameter values scanning the posterior support to picking transition kernels on both the parameter and the sample spaces, to properly cooling the annealing schedule [always a fun part!], there seems to be [from my armchair expert’s perspective, of course!] a wide range of opportunities for missing the target or running into zero acceptance problems. Both examples analysed in the paper, the auto-logistic and the auto-normal models, are actually of limited complexity in that they depend on a few parameters, 2 and 4 resp., and enjoy sufficient statistics, of dimensions 2 and 4 as well. Hence simulating (pseudo-)realisations of those sufficient statistics should be less challenging than the original approach replicating an entire vector of thousands of dimensions.

## a Venezia [ESOBE 2016]

Posted in pictures, Running, Statistics, Travel, University life, Wines with tags , , , , , , on October 26, 2016 by xi'an

Tomorrow I am off to Venezia for three days, attending the ESOBE 2016 workshop, where ESOBE stands for European Seminar on Bayesian Econometrics. This year it is indeed taking place in Venezia, Università Ca’ Foscari, in this beautiful building on the Gran Canale, and I have been invited to give a talk. Excited to get back to this unique place, hoping the high water will not be too high to prevent getting around (at random as usual).

## positions in French universities [deadline]

Posted in Kids, Statistics, University life with tags , , , , , , , on October 21, 2016 by xi'an

For ‘Og’s readers interested in lecturer or professor positions in French universities next academic year, including a lecturer position at Paris-Dauphine in applied and computational statistics!, you need to apply for a qualification label by a national committee which strict deadline is next Tuesday, October 25, at 4pm (Paris/CET time).  (The whole procedure is exclusively in French!)

## Suffrage Science awards in maths and computing

Posted in pictures, Statistics, University life with tags , , , , , , , , , , , on October 21, 2016 by xi'an

On October 11, at Bletchley Park, the Suffrage Science awards in mathematics and computer sciences were awarded for the first time to 12 senior female researchers. Among whom three statisticians, Professor Christl Donnelly from Imperial College London, my colleague at Warwick, Jane Hutton, and my friend and co-author, Sylvia Richardson, from MRC, Cambridge University. This initiative was started by the Medical Research Council in 2011 by Suffrage Science awards for life sciences, followed in 2013 by one for engineering and physics, and this year for maths and computing. The name of the award aims to connect with the Suffragette movement of the late 19th and early 20th Centuries, which were particularly active in Britain. One peculiar aspect of this award is that the recipients are given pieces of jewellery, created for each field, pieces that they will themselves give two years later to a new recipient of their choice, and so on in an infinite regress! (Which suggests a related puzzle, namely to figure out how many years it should take until all female scientists have received the award. But since the number increases as the square of the number of years, this is not going to happen unless the field proves particularly hostile to women scientists!) This jewellery award also relates to the history of the Suffragette movement since the WPSU commissioned their own jewellery awards. A clever additional touch was that the awards were delivered on Ada Lovelace Day, October 11.

## David Cox gets the first International Prize in Statistics

Posted in pictures, Statistics, University life with tags , , , , , , , , on October 20, 2016 by xi'an

Just received an email from the IMS that Sir David Cox (Nuffield College, Oxford) has been awarded the International Prize in Statistics. As discussed earlier on the ‘Og, this prize is intended to act as the equivalent of a Nobel prize for statistics. While I still have reservations about the concept. I have none whatsoever about the nomination as David would have been my suggestion from the start. Congratulations to him for the Prize and more significantly for his massive contributions to statistics, with foundational, methodological and societal impacts! [As Peter Diggle, President of the Royal Statistical Society just pointed out, it is quite fitting that it happens on European Statistics day!]

## an attempt at EP-ABC from scratch, nothing more… [except for a few bugs]

Posted in R, Statistics, University life with tags , , , , , , on October 19, 2016 by xi'an

Following a request from one of the reviewers of our chapter Likelihood-free model choice, I tried to run EP-ABC on a toy problem and to compare it with the outcome of a random forest ABC. Literally starting from scratch, namely from the description found in Simon and Nicolas’ JASA paper.  To run my test, I chose as my modelled data an exponential Exp(λ) sample of size 20, with a Gaussian N(0,1) prior on the log parameter (here replicated 100 times):

n=20
trueda=matrix(rexp(100*n),ncol=n)
#prior values
er0=0
Q0=1


Then I applied the formulas found in the paper for approximating the evidence, first defining the log normalising constant for an unnormalised Gaussian density as on the top of p.318

Psi=function(er,Q){
-.5*(log(Q/2/pi)-Q*er*er)}


[which exhibited a typo in the paper, as Simon Barthelmé figured out after emails exchanges that the right formulation was the inverse]

Psi=function(er,Q){
-.5*(log(Q/2/pi)-er*er/Q)}


and iterating the site updates as described in Sections 2.2, 3.1 and Algorithms 1 and 3:

ep=function(dat,M=1e4,eps,Nep=10){
n=length(dat)
#global and ith natural parameters for Gaussian approximations
#initialisation:
Qi=rep(0,n)
Q=Q0+sum(Qi)
eri=rep(0,n)
er=er0+sum(eri)
#constants Ci in (2.6)
normz=rep(1,n)
for (t in 1:Nep){
for (i in sample(1:n)){
#site i update
Qm=Q-Qi[i] #m for minus i
erm=er-eri[i]
#ABC sample
thetas=rnorm(M,me=erm/Qm,sd=1/sqrt(Qm))
dati=rexp(M,rate=exp(thetas))
#update by formula (3.3)
Mh=sum((abs(dati-dat[i])<eps))
mu=mean(thetas[abs(dati-dat[i])<eps])
sig=var(thetas[abs(dati-dat[i])<eps])
if (Mh>1e3){
#enough proposals accepted
Qi[i]=1/sig-Qm
eri[i]=mu/sig-erm
Q=Qm+Qi[i]
er=erm+eri[i]
#update of Ci as in formula (2.7)
#with correction bottom of p.319
normz[i]=log(Mh/M/2/eps)- Psi(er,Q)+Psi(erm,Qm)
}}}
#approximation of the log evidence as on p.318
return(sum(normz)+Psi(er,Q)-Psi(er0,Q0)) }


except that I made an R beginner’s mistake (!) when calling the normal simulation to be

thetas=rnorm(M,me=erm/Qm)/sqrt(Qm)


to be compared with the genuine evidence under a conjugate Exp(1) prior [values of the evidence should differ but should not differ that much]

ev=function(dat){
n=length(dat)
lgamma(n+1)-(n+1)*log(1+sum(dat))
}


After correcting for those bugs and too small sample sizes, thanks to Simon kind-hearted help!, running this code results in minor discrepancies between both quantities:

evid=ovid=rep(0,100)
M=1e4 #number of ABC samples
propep=.1 #tolerance factor
for (t in 1:100){
#tolerance scaled by data
epsi=propep*sd(trueda[t,])
evid[t]=ep(trueda[t,],M,epsi)
ovid[t]=ev(trueda[t,])
}
plot(evid,ovid,cex=.6,pch=19)


as shown by the comparison below between the evidence and the EP-ABC approximations to the evidence, called epabence (the dashed line is the diagonal):

Obviously, this short (if lengthy at my scale) experiment is not intended to conclude that EP-ABC work or does not work. (Simon also provided additional comparison with the genuine evidence, that is under the right prior, and with the zero Monte-Carlo version of the EP-ABC evidence, achieving a high concentration over the diagonal.) I just conclude that the method does require a certain amount of calibration to become stable. Calibrations steps that are clearly spelled out in the paper (adaptive choice of M, stabilisation by quasi-random samples, stabilisation by stochastic optimisation steps and importance sampling), but also imply that EP-ABC is also “far from routine use because it make take days to tune on real problems” (pp.315-316). Thinking about the reasons for such discrepancy over the past days, one possible explanation I see for the impact of the calibration parameters is that the validation of EP if any occurs at a numerical rather than Monte Carlo level, hence that the Monte Carlo error must be really small to avoid interfering with the numerical aspects.

## scalable Langevin exact algorithm

Posted in Books, Statistics, Travel, University life with tags , , , , , , , , , , , on October 18, 2016 by xi'an

“By employing a modification to existing naïve subsampling techniques we can obtain an algorithm which is still exact but has sub-linear iterative cost as a function of data size.”

A few weeks ago Murray Pollock, Paul Fearnhead, Adam Johansen and Gareth Roberts (all from Warwick except for Paul) arXived a paper The Scalable Langevin Exact Algorithm: Bayesian Inference for Big Data. (This was also the topic of Murray’s talk last year at JSM in Seattle.) One major advance found in the paper is the derivation of an “exact” algorithm that is sub-linear in the data size. As discussed in the introduction, the current approaches to large data problems either suffer from being approximate (like divide-and-conquer methods) or do not achieve significant reduction in the computing time, being of order O(n). The authors mention Teh and Welling (2011) sand their tochastic gradient approximation to the Langevin diffusion, when the gradient is based on a subsample. Without the Metropolis correction that would ensure an exact target but at a cost of order O(n). (Which makes the technique rather difficult to recommend.)

A novel [for me] notion at the core of this paper is the concept of quasi-stationary distribution, which is the limiting distribution of a Markov chain X[t] conditional on a Markov stopping time [being larger than t]. The approach is based on diffusions with appropriate stationary distributions like the Langevin diffusion. (Actually, as in most papers I have read and remember, the current paper only considers the Langevin diffusion.) In order to avoid the issues with unadjusted and Metropolis-adjusted Langevin schemes, a killed Brownian motion is created, which means a Brownian motion conditional of being alive till time T when the instantaneous killing rate is a given function of the chain, Φ(X[t]), related with the stationary measure of the Langevin diffusion ν. Under appropriate conditions, the density of this killed Brownian motion converges [in T] to √ν. Which immediately hints at creating a new Langevin diffusion targeting ν² instead of ν. And killing it with the proper rate, which can be done by thinning a Poisson process. Simulating the original path can be done by path-space rejection sampling, following the technique set by Gareth Roberts and co-authors more than ten years ago. Based on finite dimensional realisations of the path on [0,T]. And including the killing part can be done by importance sampling and checking that the simulated killing time is larger than the current (exponentially simulated) time.

One practical difficulty in the implementation of this neat principle is the derivation of the normalising constant, which evaluation degrades with the time horizon T. The solution adopted in the paper is through a sequential Monte Carlo method, using another discretisation of the time interval [0,T] (relying on the original one would get too costly?). As for subsampling, since the survival probability for the Brownian motion is based on an unbiased estimator, subsampling does not hurt if conducted in a random manner. Although this increases the variance on principle, the use of a control variate computed just once helps in reducing the complexity to O(1).

This is a tough paper and I have not gone through the effort of trying to implement it, but this is an original and innovative construct I would like to monitor in further details on a toy example, maybe next week while in Warwick. Or at least to discuss it with the authors.