## R finals

Posted in R, Statistics, University life with tags , , , , , , , , on January 31, 2013 by xi'an

On the morning I returned from Varanasi and the ISBA meeting there, I had to give my R final exam (along with three of my colleagues in Paris-Dauphine). This year, the R course was completely in English, exam included, which means I can post it here as it may attract more interest than the French examens of past years…

I just completed grading my 32 copies, all from exam A, which takes a while as I have to check (and sometimes recover) the R code, and often to correct the obvious mistakes to see if the deeper understanding of the concepts is there. This year student cohort is surprisingly homogeneous: I did not spot any of the horrors I may have mentioned in previous posts.

I must alas acknowledge a grievous typo in the version of Exam B that was used the day of the final: cutting-and-pasting from A to B, I forgot to change the parameters in Exercise 2, asking them to simulate a Gamma(0,1). It is only after half an hour that a bright student pointed out the impossibility… We had tested the exams prior to printing them but this somehow escaped the four of us!

Now, as I was entering my grades into the global spreadsheet, I noticed a perfect… lack of correlation between those and the grades at the midterm exam. I wonder what that means: I could be grading at random, the levels in November and in January could be uncorrelated, some students could have cheated in November and others in January, student’s names or file names got mixed up, …? A rather surprising outcome!

## optimising accept-reject

Posted in R, Statistics, University life with tags , , , , , , on November 21, 2012 by xi'an

I spotted on R-bloggers a post discussing optimising the efficiency of programming accept-reject algorithms. While it is about SAS programming, and apparently supported by the SAS company, there are two interesting features with this discussion. The first one is about avoiding the dreaded loop in accept-reject algorithms. For instance, taking the case of the truncated-at-one Poisson distribution, the code

rtpois=function(n,lambda){
sampl=c()
while (length(sampl)<n){
x=rpois(1,lambda)
if (x!=0) sampl=c(sampl,x)}
return(sampl)
}


is favoured by my R course students but highly inefficient:

> system.time(rtpois(10^5,.5))
user  system elapsed
61.600  27.781  98.727


both for the stepwise increase in the size of the vector and for the loop. For instance, defining the vector sampl first requires a tenth of the above time (note the switch from 10⁵ to 10⁶):

> system.time(rtpois(10^6,.5))
user  system elapsed
54.155   0.200  62.635


As discussed by the author of the post, a more efficient programming should aim at avoiding the loop by predicting the number of proposals necessary to accept a given number of values. Since the bound M used in accept-reject algorithms is also the expected number of attempts for one acceptance, one should start with something around Mn proposed values. (Assuming of course all densities are properly normalised.) For instance, in the case of the truncated-at-one Poisson distribution based on proposals from the regular Poisson, the bound is 1/1-e. A first implementation of this principle is to build the sample via a few loops:

rtpois=function(n,lambda){
propal=rpois(ceiling(n/(1-exp(-lambda))),lambda)
propal=propal[propal>0]
n0=length(propal)
if (n0>=n)
return(propal[1:n])
else return(c(propal,rtpois(n-n0,lambda)))
}


with a higher efficiency:

> system.time(rtpois(10^6,.5))
user  system elapsed
0.816   0.092   0.928


Replacing the expectation with an upper bound using the variance of the negative binomial distribution does not make a significant dent in the computing time

rtpois=function(n,lambda){
M=1/(1-exp(-lambda))
propal=rpois(ceiling(M*(n+2*sqrt(n/M)/(M-1))),lambda)
propal=propal[propal>0]
n0=length(propal)
if (n0>=n)
return(propal[1:n])
else return(c(propal,rtpois(n-n0,lambda)))}


since we get

> system.time(rtpois(10^6,.5))
user  system elapsed
0.824   0.048   0.877


The second point about this Poisson example is that simulating a distribution with a restricted support using another distribution with a larger support is quite inefficient. Especially when λ goes to zero By comparison, using a Poisson proposal with parameter μ and translating it by 1 may bring a considerable improvement: without getting into the gory details, it can be shown that the optimal value of μ (in terms of maximal acceptance probability) is λ and that the corresponding probability of acceptance is

$\dfrac{1-\exp\{-\lambda\}}{\lambda}$

which is larger than the probability of the original approach when λ is less than one. As shown by the graph below, this allows for a lower bound in the probability of acceptance that remains tolerable.

## R midterms

Posted in Kids, Linux, R, Statistics, University life with tags , , , , , , , , , , , on November 9, 2012 by xi'an

Here are my R midterm exams, version A and version B in English (as students are sitting next to one another in the computer rooms), on simulation methods for my undergrad exploratory statistics course. Nothing particularly exciting or innovative! Dedicated ‘Og‘s readers may spot a few Le Monde puzzles in the lot…

Two rather entertaining if mundane occurences related to this R exam: one hour prior to the exam, a student came to my office to beg for being allowed to take the solution manual with her (as those midterm exercises are actually picked from an exercise booklet, some students cooperated towards producing a complete solution manual and this within a week!), kind of missing the main point of having an exam. (I have not seen yet this manual but I’d be quite interested in checking the code they produced on that occasion…) During the exam, another student asked me what was the R command to turn any density into a random generator: he had written a density function called mydens and could not fathom why rmydens(n) was not working. The same student later called me as his computer was “stuck”: he was not aware that a “+” prompt on the command line meant R was waiting for him to complete the command… A less comical event that ended well is that a student failed to save her R code (periodically and) at the end of the exam and we had to dig very deep into the machine to salvage her R commands from \tmp as rkward safeguards, as only the .RData file was available at first. I am glad we found this before turning the machine off, otherwise it would have been lost.

## generalised ratio of uniforms

Posted in R, Statistics, University life with tags , , , on May 15, 2012 by xi'an

A recent arXiv posting of the paper “On the Generalized Ratio of Uniforms as a Combination of Transformed Rejection and Extended Inverse of Density Sampling” by Martino, Luengo, and Míguez from Madrid rekindled my interest in this rather peculiar simulation method. The ratio of uniforms samples uniformly on the subgraph

$\mathcal{A}=\{(v,u);\,0\le u\le\sqrt{p(v/u)}\}$

to produce simulations from p as the ratio v/u. The proof is straightforward first year calculus but I do not find the method intuitive as, say, accept/reject…. The paper gives a very detailed background on those methods, as well as on the “inverse of density method”, which is like looking at the uniform simulation over the subgraph, but with both axes inverted (slice sampling is the same on both). (A minor point of contention or at least misunderstanding: when using the inverse of density method, the authors claim that using the unormalised and the normalised versions of the target leads to the same outcome. While it is true for the direct method, I have trouble seeing the equivalent in the inverse case…) The paper also stresses that the optimal case for accept-reject is when the target is bounded because the uniform can then be used as a proposal. I agree this is a simpler solution but fail to see any optimality in the matter. The authors then study ways of transforming unbounded subgraphs into bounded domains (i.e. bounded pdfs and supports). This imposes conditions on the transform f, which must have finite limits for p(x)/f’(x) or p-1(x)/f’(x) at the boundaries. (An optimal choice is when f is the cdf of p, since then the transform is uniform.)

The remainder (and more innovative) part of the paper is less clear in that I do not get a generic feeling on what it is about! The generalisation of the above is to consider uniform sampling from

$\mathcal{A}_g=\big\{(v,u);\,0\le u\le g^{-1}\{c p[v/g'(u)]\}\big\}$

for a generic increasing function g such that g(0)=0. And c a positive constant. (Any positive constant?!) But this is from a 1991 paper by Jon Wakefield, Alan Gelfand, and Adrian Smith. The extension is thus in finding g such that the above region is bounded and can be explored by uniform sampling over a box.. And in noticing that “the generalized Ratio-of-Uniform method is a combination of the transformed rejection method applied to the inverse density with the extended inverse-of-density method” (p.27).

I wonder at the applicability of the approach for costly target functions p. And at the extension to larger dimensions. And wish I had more time (or more graduate students) to look at possible adaptive constructions of the transform g. An interesting and fruitful read, nonetheless!

## ultimate R recursion

Posted in Books, R, Statistics, University life with tags , , , , , , on January 31, 2012 by xi'an

One of my students wrote the following code for his R exam, trying to do accept-reject simulation (of a Rayleigh distribution) and constant approximation at the same time:

fAR1=function(n){
u=runif(n)
x=rexp(n)
f=(C*(x)*exp(-2*x^2/3))
g=dexp(n,1)
test=(u<f/(3*g))
y=x[test]
p=length(y)/n #acceptance probability
M=1/p
C=M/3
hist(y,20,freq=FALSE)
return(x)
}


which I find remarkable if alas doomed to fail! I wonder if there exists a (real as opposed to fantasy) computer language where you could introduce constants C and only define them later… (What’s rather sad is that I keep insisting on the fact that accept-reject does not need the constant C to operate. And that I found the same mistake in several of the students’ code. There is a further mistake in the above code when defining g. I also wonder where the 3 came from…)