**J**ulie Josse contacted me for advertising a postdoc position at École Polytechnique, in Palaiseau, south of Paris. *“The fellowship is focusing on missing data. Interested graduates should apply as early as possible since the position will be filled when a suitable candidate is found. The Centre for Applied Mathematics (CMAP) is looking for highly motivated individuals able to develop a general multiple imputation method for multivariate continuous and categorical variables and its implementation in the free R software. The successful candidate will be part of research group in the statistical team on missing values. The postdoc will also have excellent opportunities to collaborate with researcher in public health with partners on the analysis of a large register from the Paris Hospital (APHP) to model the decisions and events when severe trauma patients are handled by emergency doctors. Candidates should contact Julie Josse at polytechnique.edu.”*

## Archive for R

## postdoc on missing data at École Polytechnique

Posted in Kids, pictures, R, Statistics, Travel, University life with tags École Polytechnique, Bayesian inference, CMAP, France, generalized SVD, latent variable models, matrix completion, missing data, Palaiseau, Paris, postdoctoral position, R on November 18, 2016 by xi'an## copy code at your own peril

Posted in Books, Kids, R, Statistics, University life with tags Bayesian inference, cross validated, exported code, normal model, R, xkcd on November 14, 2016 by xi'an**I** have come several times upon cases of scientists [I mean, real, recognised, publishing, senior scientists!] from other fields blindly copying MCMC code from a paper or website, and expecting the program to operate on their own problem… One illustration is from last week, when I read a X Validated question [from 2013] about an attempt of that kind, on a rather standard Normal posterior, but using an R code where the posterior function was not even defined. (I foolishly replied, despite having no expectation of a reply from the person asking the question.)

## Le Monde puzzle [#940]

Posted in Kids, Statistics, Travel, University life with tags Le Monde, mathematical puzzle, R, sudoku on November 11, 2016 by xi'an**A** sudoku-like Le Monde mathematical puzzle:

On a 3×3 grid, all integers from 1 to 9 are present. Considering all differences between adjacent entries, the value of the grid is the minimum difference. What is the maximum possible value?

In a completely uninspired approach considering random permutations on {1,..,9}, the grid value can be computed as

neigh=c(1,2,4,5,7,8,1,4,2,5,3,6)

nigh=c(2,3,5,6,8,9,4,7,5,8,6,9)

perm=sample(9)

val<-function(perm){

min(abs(perm[neigh]-perm[nigh]))}

which produces a value of 3 for the maximal value. For a 4×4 grid

neigh=c(1:3,5:7,9:11,13:15,1+4*(0:2),2+4*(0:2),3+4*(0:2),4*(1:3))

nigh=c(2:4,6:8,10:12,14:16,1+4*(1:3),2+4*(1:3),3+4*(1:3),4*(2:4))

perm=sample(16)

val<-function(perm){

min(abs(perm[neigh]-perm[nigh]))}

the code returns 5. For the representation

[,1] [,2] [,3] [,4]

[1,] 8 13 3 11

[2,] 15 4 12 5

[3,] 9 14 6 16

[4,] 2 7 1 10

## ratio-of-uniforms [#2]

Posted in Books, R, Statistics with tags Luc Devroye, mixtures of distributions, Non-Uniform Random Variate Generation, pseudo-random generator, R, ratio of uniform algorithm, slice sampler on October 31, 2016 by xi'an**F**ollowing my earlier post on Kinderman’s and Monahan’s (1977) ratio-of-uniform method, I must confess I remain quite puzzled by the approach. Or rather by its consequences. When looking at the set A of (u,v)’s in **R⁺**×**X** such that 0≤u²≤**ƒ**(v/u), as discussed in the previous post, it can be represented by its parameterised boundary

u(x)=√**ƒ**(x),v(x)=x√**ƒ**(x) x in **X**

Similarly, since the simulation from **ƒ**(v/u) can also be derived [check Luc Devroye’s Non-uniform random variate generation in the exercise section 7.3] from a uniform on the set B of (u,v)’s in **R⁺**×**X** such that 0≤u≤**ƒ**(v+u), on the set C of (u,v)’s in **R⁺**×**X** such that 0≤u³≤**ƒ**(v/√u)², or on the set D of (u,v)’s in **R⁺**×**X** such that 0≤u²≤**ƒ**(v/u), which is actually exactly the same as A *[and presumably many other versions!, for which I would like to guess the generic rule of construction]*, there are many sets on which one can consider running simulations. And one to pick for optimality?! Here are the three sets for a mixture of two normal densities:

For instance, assuming slice sampling is feasible on every one of those three sets, which one is the most efficient? While I have no clear answer to this question, I found on Sunday night that a generic family of transforms is indexed by a differentiable monotone function h over the positive half-line, with the uniform distribution being taken over the set

H={(u,v);0≤u≤h(f(v/g(u))}

when the primitive G of g is the inverse of h, i.e., G(h(x))=x. [Here are the slides I gave at the Warwick reading group on Devroye’s book last week:]

## an attempt at EP-ABC from scratch, nothing more… [except for a few bugs]

Posted in R, Statistics, University life with tags ABC, computer experiment, expectation-propagation, Gaussian approximation, implementation, Monte Carlo Statistical Methods, R on October 19, 2016 by xi'an**F**ollowing a request from one of the reviewers of our chapter Likelihood-free model choice, I tried to run EP-ABC on a toy problem and to compare it with the outcome of a random forest ABC. Literally starting from scratch, namely from the description found in Simon and Nicolas’ JASA paper. To run my test, I chose as my modelled data an exponential Exp(λ) sample of size 20, with a Gaussian N(0,1) prior on the log parameter (here replicated 100 times):

n=20 trueda=matrix(rexp(100*n),ncol=n) #prior values er0=0 Q0=1

Then I applied the formulas found in the paper for approximating the evidence, first defining the log normalising constant for an unnormalised Gaussian density as on the top of p.318

Psi=function(er,Q){ -.5*(log(Q/2/pi)-Q*er*er)}

[which exhibited a typo in the paper, as Simon Barthelmé figured out after emails exchanges that the right formulation was the inverse]

Psi=function(er,Q){ -.5*(log(Q/2/pi)-er*er/Q)}

and iterating the site updates as described in Sections 2.2, 3.1 and Algorithms 1 and 3:

ep=function(dat,M=1e4,eps,Nep=10){ n=length(dat) #global and ith natural parameters for Gaussian approximations #initialisation: Qi=rep(0,n) Q=Q0+sum(Qi) eri=rep(0,n) er=er0+sum(eri) #constants Ci in (2.6) normz=rep(1,n) for (t in 1:Nep){ for (i in sample(1:n)){ #site i update Qm=Q-Qi[i] #m for minus i erm=er-eri[i] #ABC sample thetas=rnorm(M,me=erm/Qm,sd=1/sqrt(Qm)) dati=rexp(M,rate=exp(thetas)) #update by formula (3.3) Mh=sum((abs(dati-dat[i])<eps)) mu=mean(thetas[abs(dati-dat[i])<eps]) sig=var(thetas[abs(dati-dat[i])<eps]) if (Mh>1e3){ #enough proposals accepted Qi[i]=1/sig-Qm eri[i]=mu/sig-erm Q=Qm+Qi[i] er=erm+eri[i] #update of Ci as in formula (2.7) #with correction bottom of p.319 normz[i]=log(Mh/M/2/eps)- Psi(er,Q)+Psi(erm,Qm) }}} #approximation of the log evidence as on p.318 return(sum(normz)+Psi(er,Q)-Psi(er0,Q0)) }

except that I made an R beginner’s mistake (!) when calling the normal simulation to be

thetas=rnorm(M,me=erm/Qm)/sqrt(Qm)

to be compared with the genuine evidence under a conjugate Exp(1) prior *[values of the evidence should differ but should not differ that much]*

ev=function(dat){ n=length(dat) lgamma(n+1)-(n+1)*log(1+sum(dat)) }

After correcting for those bugs and too small sample sizes, thanks to Simon kind-hearted help!, running this code results in minor discrepancies between both quantities:

evid=ovid=rep(0,100) M=1e4 #number of ABC samples propep=.1 #tolerance factor for (t in 1:100){ #tolerance scaled by data epsi=propep*sd(trueda[t,]) evid[t]=ep(trueda[t,],M,epsi) ovid[t]=ev(trueda[t,]) } plot(evid,ovid,cex=.6,pch=19)

as shown by the comparison below between the evidence and the EP-ABC approximations to the evidence, called *epabence* (the dashed line is the diagonal):

Obviously, this short (if lengthy at my scale) experiment is not intended to conclude that EP-ABC work or does not work. (Simon also provided additional comparison with the genuine evidence, that is under the right prior, and with the zero Monte-Carlo version of the EP-ABC evidence, achieving a high concentration over the diagonal.) I just conclude that the method does require a certain amount of calibration to become stable. Calibrations steps that are clearly spelled out in the paper (adaptive choice of M, stabilisation by quasi-random samples, stabilisation by stochastic optimisation steps and importance sampling), but also imply that EP-ABC is also “far from routine use because it make take days to tune on real problems” (pp.315-316). Thinking about the reasons for such discrepancy over the past days, one possible explanation I see for the impact of the calibration parameters is that the validation of EP if any occurs at a numerical rather than Monte Carlo level, hence that the Monte Carlo error must be really small to avoid interfering with the numerical aspects.

## tractable Bayesian variable selection: beyond normality

Posted in R, Statistics, University life with tags Bayesian Essentials with R, calibration, marginal density, maximum likelihood estimation, parametric family, R, two-piece error model, University of Warwick on October 17, 2016 by xi'an**D**avid Rossell and Francisco Rubio (both from Warwick) arXived a month ago a paper on non-normal variable selection. They use two-piece error models that preserve manageable inference and allow for simple computational algorithms, but also characterise the behaviour of the resulting variable selection process under model misspecification. Interestingly, they show that the existence of asymmetries or heavy tails leads to power losses when using the Normal model. The two-piece error distribution is made of two halves of location-scale transforms of the same reference density on the two sides of the common location parameter. In this paper, the density is either Gaussian or Laplace (i.e., exponential?). In both cases the (log-)likelihood has a nice compact expression (although it does not allow for a useful sufficient statistic). One is the L¹ version versus the other which is the L² version. Which is the main reason for using this formalism based on only two families of parametric distributions, I presume. (As mentioned in an earlier post, I do not consider those distributions as mixtures because the component of a given observation can always be identified. And because as shown in the current paper, maximum likelihood estimates can be easily derived.) The prior construction follows the non-local prior principles of Johnson and Rossell (2010, 2012) also discussed in earlier posts. The construction is very detailed and hence highlights how many calibration steps are needed in the process.

“Bayes factor rates are the same as when the correct model is assumed [but] model misspecification often causes a decrease in the power to detect truly active variables.”

When there are too many models to compare at once, the authors propose a random walk on the finite set of models (which does not require advanced measure-theoretic tools like reversible jump MCMC). One interesting aspect is that moving away from the normal to another member of this small family is driven by the density of the data under the marginal densities, which means moving only to interesting alternatives. But also sticking to the normal only for adequate datasets. In a sense this is not extremely surprising given that the marginal likelihoods (model-wise) are available. It is also interesting that on real datasets, one of the four models is heavily favoured against the others, be it Normal (6.3) or Laplace (6.4). And that the four model framework returns almost identical values when compared with a single (most likely) model. Although not immensely surprising when acknowledging that the frequency of the most likely model is 0.998 and 0.998, respectively.

“Our framework represents a middle-ground to add flexibility in a parsimonious manner that remains analytically and computationally tractable, facilitating applications where either p is large or n is too moderate to fit more flexible models accurately.”

Overall, I find the experiment quite conclusive and do not object [much] to this choice of parametric family in that it is always more general and generic than the sempiternal Gaussian model. That we picked in our Bayesian Essentials, following tradition. In a sense, it would be natural to pick the most general possible parametric family that allows for fast computations, if this notion does make any sense…

## grim knight [a riddle]

Posted in Kids, pictures, R with tags chess, R, self-avoiding random walk, The Riddler on October 14, 2016 by xi'anThe Riddler of this week had a riddle that is a variation of the knight tour problem, namely

“…how long is the longest path a knight can travel on a standard 8-by-8 chessboard without letting the path intersect itself?”

the riddle being then one of a self-avoiding random walk [kind]… As I could not get back to sleep last night, I spent a couple hours (!) on this riddle, programming a random walk [or more accurately, a random canter]. This is a brute-force approach in that I pick any acceptable move with the same probability and stop when there is no further move available. [The title refers to the recommendation to avoid the rim of the chessboard with a knight: “a knight on the rim is grim”…]

board=rep(1,64) curr=sample(1:64,1) board[curr]=0 cont=0 stop=TRUE while (stop){ mov=nexx(curr,board) curr=mov$mov;board=mov$boa stop=(curr>0);cont=cont+stop}

with my function *nexx* a rather clumsy 50 lines business of selecting one acceptable move from the current position *curr*. This function returns the proposed move as well as the updated board with zeros in squares already visited by the knight. Which highlights the ambiguity in the question, namely how one defines the path of a knight? For an acceptable knight move from A to B, there are two possible paths: either take two steps in one direction and one in the orthogonal direction or the opposite. I thus pick one of the two (at random) and prohibit further visits to those squares. An alternative meaning of the question could be that the line joining A to B cannot be crossed ever again, which excludes less moves (but is more cumbersome to code). Anyway, with the former interpretation of a *path*, repeating the self-avoiding moves led to a maximum of 19 moves, with one solution exhibited below. (Since (64-1)/3=21, it is conceivable that the true maximum is 20 or even 21. In the path representation below, it seems possible to include yet another move by going to (4,1) instead of (4,5). But this is apparently excluded by the square representation on the right. Why is why the path representation is somewhat confusing!)

Today, namely on October 15, I received a solution of length 21, hence covering the entire board without ever using the same square twice. It was sent to me by Paul-Henry Cournède (a geographical neighbour!) and is “obvious” once you see it. Which may be why the alternative interpretation of “path” was chosen in The Riddler. And why my rhs representation is clearly misleading!