**N**ext month, I am taking part in a workshop on sampling & clustering at the Max-Planck-Institut für Physik in Garching, Germany (near München). By giving a three hour introduction to ABC, as I did three years ago in Autrans. Being there and talking with local researchers if the sanitary conditions allow. From my office otherwise. Other speakers include Michael Betancourt on HMC and Johannes Buchner on nested sampling. The remote participation to this MPI workshop is both open and free, **but participants must register before 18 September, namely tomorrow.**

## Archive for sampling

## state of the art in sampling & clustering [workshop]

Posted in Books, pictures, Statistics, Travel, University life with tags Autrans, clustering, Garching, Germany, HMC, Max Planck Institute, Max-Planck-Institut für Physik, München, nested sampling, sampling, workshop on September 17, 2020 by xi'an## a perfectly normally distributed sample

Posted in R, Statistics with tags bayestestR, blog, R, R-bloggers, random number generator, rnorm(), rnorm_perfect(), sample, sampling on May 9, 2019 by xi'an**W**hen I saw this title on R-bloggers, I was wondering how “more perfect” a Normal sample could be when compared with the outcome of rnorm(n). Hence went checking the original blog on `bayestestR`

in search of more information. Which was stating nothing more than how to generate a sample is *perfectly* normal by using the `rnorm_perfect`

function. Still unsure of the meaning, I contacted one of the contributors who replied very quickly

…that’s actually a good question. I would say an empirical sample having characteristics as close as possible to a cannonic gaussian distribution.

`bayestestR`

and opened the `rnorm_perfect`

function. Which is simply the sequence of n-quantilesstats::qnorm(seq(1/n, 1 – 1/n, length.out = n), mean, sd)

## approximate Bayesian inference under informative sampling

Posted in Books, Statistics, Travel, University life with tags ABC, approximate Bayesian inference, Bayesian semi-parametrics, Bernstein-von Mises theorem, Biometrika, estimating equations, generalised method of moments, RER B, Ron Gallant, sampling on March 30, 2018 by xi'an**I**n the first issue of this year Biometrika, I spotted a paper with the above title, written by Wang, Kim, and Yang, and thought it was a particular case of ABC. However, when I read it on a rare metro ride to Dauphine, thanks to my hurting knee!, I got increasingly disappointed as the contents had nothing to do with ABC. The purpose of the paper was to derive a consistent and convergent posterior distribution based on a estimator of the parameter θ that is… consistent and convergent under informative sampling. Using for instance a Normal approximation to the sampling distribution of this estimator. Or to the sampling distribution of the pseudo-score function, S(θ) [which pseudo-normality reminded me of Ron Gallant’s approximations and of my comments on them]. The paper then considers a generalisation to the case of estimating equations, U(θ), which may again enjoy a Normal asymptotic distribution. Involving an object that does not make direct Bayesian sense, namely the posterior of the parameter θ given U(θ)…. (The algorithm proposed to generate from this posterior (8) is also a mystery.) Since the approach requires consistent estimators to start with and aims at reproducing frequentist coverage properties, I am thus at a loss as to why this pseudo-Bayesian framework is adopted.

## sampling by exhaustion

Posted in Books, Kids, R, Statistics with tags birthday problem, occupancy, sampling, The Riddler, William Feller on November 25, 2016 by xi'an**T**he riddle set by The Riddler of last week sums up as follows:

Within a population of size N, each individual in the population independently selects another individual. All individuals selected at least once are removed and the process iterates until one or zero individual is left. What is the probability that there is zero individual left?

While I cannot see a clean analytical solution to this problem, it reminds me of an enveloppe-versus-letter (matches) problem I saw in graduate school. Indeed, the expected number of removed (or selected) individuals is given by

which is equivalent to (1-e⁻¹)N for N large, meaning that the population decreases by an average factor of e⁻¹ at each round. And that it takes on average approximately log(N) iterations to reach a single individual. A simulation of the first probabilities of ending with one individual led me to the above curve, which wiggles in an almost periodic way around the probability ½, equal to the average of those probabilities. Using the R code

rad=function(N){#next population size ut=sample(rep(2:N,2),1) for (i in 2:N)#sampling ut=c(ut,sample(rep((1:N)[-i],2),1)) return(N-length(unique(ut))} sal=rep(0,m);sal[1]=1 for (N in 3:M){ prop=0; for (t in 1:T){#one single step i=rad(N) if (i>0) prop=prop+sal[i]} sal[N]=prop/T}

which exploits the previously computed probabilities. The variability is most interesting if unexpected, but looking back at Feller‘s sections and exercises on the classical occupancy problem, I could not find a connection with this problem. If it exists. Still, if N is large enough, the exclusion of one index from the selection becomes negligible and the probability of moving from n to m individuals should be approximately [Feller, eqn (2.4), p.102]

This formula approximates quite well the exact probability, but as in a previous post about the birthday problem, it proves quite delicate to compute. As already noticed by Feller.

## gap frequencies [& e]

Posted in Kids, R with tags misanthrope, R, sampling, The Riddler on April 29, 2016 by xi'an**A** riddle from The Riddler where brute-force simulation does not pay:

For a given integer N, pick at random without replacement integers between 1 and N by prohibiting consecutive integers until all possible entries are exhausted. What is the frequency of selected integers as N grows to infinity?

A simple implementation of the random experiment is as follows

generope=function(N){ frei=rep(1,N) hus=0 while (max(frei)==1){ i=sample(rep((1:N)[frei==1],2),1) frei[i]=frei[i+1]=frei[i-1]=0 hus=hus+1} return(hus)}

It is however quite slow and does not exploit the recursive feature of the sampling, namely that the first draw breaks the set {1,…,N} into two sets:

generipe=function(N){ if (N<2){ return((N>0))}else{ i=sample(1:N,1) return(1+generipe(i-2)+generipe(N-i-1))}}

But even this faster solution takes a while to run for large values of N:

frqns=function(N){ space=0 for (t in 1:1e3) space=space+generipe(N) return(space/(N*1e3))}

as for instance

> microbenchmark(frqns(100),time=10) Unit: nanoseconds expr min lq mean median uq max frqns(100) 178720117 185020903 212212355.77 188710872 205865816 471395620 time 4 8 26.13 32 37 102

Hence this limits the realisation of simulation to, say, N=10⁴. Thinking further about the recursive aspect of the process however leads to a recursion on the frequencies q_{N}, as it is straightforward to prove that

with q_{1}=1 and q_{2}=1/2. This recursion can be further simplified into

which allows for a much faster computation

s=seq(1,1e7) #s[n]=n*q[n] for (n in 3:1e7) s[n]=(1+2*s[n-2]+(n-1)*s[n-1])/n

and a limiting value of 0.4323324… Since storing s does not matter, a sliding update works even better:

a=b=1 for (n in 3:1e8){ c=(1+2*a+(n-1)*b)/n;a=b;b=c}

still producing a final value of 0.4323324, which may mean we have reached some limit in the precision.

As I could not figure out a way to find the limit of the sequence (1) above, I put it on the maths forum of Stack Exchange and very quickly got the answer (obtained by a power series representation) that the limit is (rather amazingly!)

which is 0.432332358.., hence very close to the numerical value obtained for n=3×10⁸. (Which does not change from n=10⁸, once again for precision reasons.) Now I wonder whether or not an easier derivation of the above is feasible, but we should learn about it in a few hours on The Riddler. [**Update:** The solution published by The Riddler is exactly that one, using a power series expansion to figure out the limit of the series, unfortunately. I was hoping for a de Montmort trick or sort of…]