## occupancy rules

Posted in Kids, R, Statistics with tags , , , , , , , on May 23, 2016 by xi'an

While the last riddle on The Riddler was rather anticlimactic, namely to find the mean of the number Y of empty bins in a uniform multinomial with n bins and m draws, with solution

$\mathbb{E}[Y]=n(1-\frac{1}{n})^m,$

[which still has a link with e in that the fraction of empty bins converges to e⁻¹ when n=m], this led me to some more involved investigation on the distribution of Y. While it can be shown directly that the probability that k bins are non-empty is

${n \choose k}\sum_{i=1}^k (-1)^{k-i}{k \choose i}(i/n)^m$

with an R representation by

miss<-function(n,m){
p=rep(0,n)
for (k in 1:n)
p[k]=choose(n,k)*sum((-1)^((k-1):0)*choose(k,1:k)*(1:k)^m)
return(rev(p)/n^m)}


I wanted to take advantage of the moments of Y, since it writes as a sum of n indicators, counting the number of empty cells. However, the higher moments of Y are not as straightforward as its expectation and I struggled with the representation until I came upon this formula

$\mathbb{E}[Y^k]=\sum_{i=1}^k {k \choose i} i! S(k,i) \left( 1-\frac{i}{n}\right)^m$

where S(k,i) denotes the Stirling number of the second kind… Or i!S(n,i) is the number of surjections from a set of size n to a set of size i. Which leads to the distribution of Y by inverting the moment equations, as in the following R code:

diss<-function(n,m){
A=matrix(0,n,n)
mome=rep(0,n)
A[n,]=rep(1,n)
mome[n]=1
for (k in 1:(n-1)){
A[k,]=(0:(n-1))^k
for (i in 1:k)
mome[k]=mome[k]+factorial(i)*as.integer(Stirling2(n,i))*
(1-(i+1)/n)^m*factorial(k)/factorial(k-i-1)}
return(solve(A,mome))}


that I still checked by raw simulations from the multinomial

zample<-function(n,m,T=1e4){
x=matrix(sample(1:n,m*T,rep=TRUE),nrow=T)
x=sapply(apply(x,1,unique),length)
return(n-x)}


## ABC random forests for Bayesian parameter inference

Posted in Books, Kids, R, Statistics, Travel, University life, Wines with tags , , , , , , , , , , , , , , on May 20, 2016 by xi'an

Before leaving Helsinki, we arXived [from the Air France lounge!] the paper Jean-Michel presented on Monday at ABCruise in Helsinki. This paper summarises the experiments Louis conducted over the past months to assess the great performances of a random forest regression approach to ABC parameter inference. Thus validating in this experimental sense the use of this new approach to conducting ABC for Bayesian inference by random forests. (And not ABC model choice as in the Bioinformatics paper with Pierre Pudlo and others.)

I think the major incentives in exploiting the (still mysterious) tool of random forests [against more traditional ABC approaches like Fearnhead and Prangle (2012) on summary selection] are that (i) forests do not require a preliminary selection of the summary statistics, since an arbitrary number of summaries can be used as input for the random forest, even when including a large number of useless white noise variables; (b) there is no longer a tolerance level involved in the process, since the many trees in the random forest define a natural if rudimentary distance that corresponds to being or not being in the same leaf as the observed vector of summary statistics η(y); (c) the size of the reference table simulated from the prior (predictive) distribution does not need to be as large as for in usual ABC settings and hence this approach leads to significant gains in computing time since the production of the reference table usually is the costly part! To the point that deriving a different forest for each univariate transform of interest is truly a minor drag in the overall computing cost of the approach.

An intriguing point we uncovered through Louis’ experiments is that an unusual version of the variance estimator is preferable to the standard estimator: we indeed exposed better estimation performances when using a weighted version of the out-of-bag residuals (which are computed as the differences between the simulated value of the parameter transforms and their expectation obtained by removing the random trees involving this simulated value). Another intriguing feature [to me] is that the regression weights as proposed by Meinshausen (2006) are obtained as an average of the inverse of the number of terms in the leaf of interest. When estimating the posterior expectation of a transform h(θ) given the observed η(y), this summary statistic η(y) ends up in a given leaf for each tree in the forest and all that matters for computing the weight is the number of points from the reference table ending up in this very leaf. I do find this difficult to explain when confronting the case when many simulated points are in the leaf against the case when a single simulated point makes the leaf. This single point ends up being much more influential that all the points in the other situation… While being an outlier of sorts against the prior simulation. But now that I think more about it (after an expensive Lapin Kulta beer in the Helsinki airport while waiting for a change of tire on our airplane!), it somewhat makes sense that rare simulations that agree with the data should be weighted much more than values that stem from the prior simulations and hence do not translate much of an information brought by the observation. (If this sounds murky, blame the beer.) What I found great about this new approach is that it produces a non-parametric evaluation of the cdf of the quantity of interest h(θ) at no calibration cost or hardly any. (An R package is in the making, to be added to the existing R functions of abcrf we developed for the ABC model choice paper.)

## Using MCMC output to efficiently estimate Bayes factors

Posted in Books, R, Statistics, University life with tags , , , , on May 19, 2016 by xi'an

As I was checking for software to answer a query on X validated about generic Bayes factor derivation, I came across an R software called BayesFactor, which only applies in regression settings and relies on the Savage-Dickey representation of the Bayes factor

$B_{01}=\dfrac{f(y|\theta^0)}{m(y)}=\dfrac{\pi(\theta^0|y)}{\pi(\theta^0)}$

when the null hypothesis writes as θ=θ⁰ (and possibly additional nuisance parameters with [roughly speaking] an independent prior). As we discussed in our paper with Jean-Michel Marin [which got ignored by large!], this representation of the Bayes factor is based on picking a very specific version of the prior, or more exactly of three prior densities. Assuming such versions are selected, I wonder at the performances of this approximation, given that it involves approximating the marginal posterior at θ⁰….

“To ensure that the Bayes factor we compute using the Savage–Dickey ratio is the the ratio of marginal densities that we intend, the condition (…) is easily met by models which specify priors in which the nuisance parameters are independent of the parameters of interest.” Morey et al. (2011)

First, when reading Morey at al. (2011), I realised (a wee bit late!) that Chib’s method is nothing but a version of the Savage-Dickey representation when the marginal posterior can be estimated in a parametric (Rao-Blackwellised) way. However, outside hierarchical models based on conjugate priors such parametric approximations are intractable and non-parametric versions must be invoked instead, which necessarily degrades the quality of the method. A degradation that escalates with the dimension of the parameter θ. In addition, I am somewhat perplexed by the use of a Rao-Blackwell argument in the setting of the Dickey-Savage representation. Indeed this representation assumes that

$\pi_1(\psi|\theta_0)=\pi_0(\psi) \ \ \text{or}\quad \pi_1(\theta_0,\psi)=\pi_1(\theta_0)\pi_0(\psi)$

which means that [the specific version of] the conditional density of θ⁰ given ψ should not depend on the nuisance parameter. But relying on a Rao-Blackwellisation leads to estimate the marginal posterior via full conditionals. Of course, θ given ψ and y may depend on ψ, but still… Morey at al. (2011) advocate the recourse to Chib’s formula as optimal but this obviously requires the full conditional to be available. They acknowledge this point as moot, since it is sufficient from their perspective to specify a conjugate prior. They consider this to be a slight modification of the model (p.377). However, I see the evaluation of an estimated density at a single (I repeat, single!) point as being the direst part of the method as it is clearly more sensitive to approximations that the evaluation of a whole integral, since the later incorporates an averaging effect by definition. Hence, even if this method was truly available for all models, I would be uncertain of its worth when compared with other methods, except the harmonic mean estimator of course!

On the side, Morey at al. (2011) study a simple one-sample t test where they use an improper prior on the nuisance parameter σ, under both models. While the Savage-Dickey representation is correct in this special case, I fail to see why the identity would apply in every case under an improper prior. In particular, independence does not make sense with improper priors. The authors also indicate the possible use of this Bayes factor approximation for encompassing models. At first, I thought this could be most useful in our testing by mixture framework where we define an encompassing model as a mixture. However, I quickly realised that using a Beta Be(a,a) prior on the weight α with a<1 leads to an infinite density value at both zero and one, hence cannot be compatible with a Savage-Dickey representation of the Bayes factor.

## AISTATS 2016 [#1]

Posted in pictures, R, Running, Statistics, Travel, Wines with tags , , , , , , , , , , , , on May 11, 2016 by xi'an

Travelling through Seville, I arrived in Càdiz on Sunday night, along with a massive depression [weather-speaking!]. Walking through the city from the station was nonetheless pleasant as this is an town full of small streets and nice houses. If with less churches than Seville! Richard Samworth gave the first plenary talk of AISTATS 2016  with a presentation on random projections for classification. His classifier is based on an average of a large number of linear random projections of the original data where the projections are chosen as minimising the prediction error over a subset of the components. The performances of this approach seem to be consistently higher than for random forests, which makes it definitely worth investigating further. (A related R package is available.)

The following talks that day covered Bayesian optimisation and probabilistic numerics, with Javier Gonzales introducing glasses for Bayesian optimisation in order to solve its myopia (!)—by which he meant predicting the output of the optimisation over n future steps. And a first mention of the Pima Indians by Daniel Hernandez-Lobato in his talk about EP with stochastic gradient steps towards optimisation. (As well as much larger datasets.) And Mark Girolami bringing quasi-Monte Carlo into control variates. A kernel based ABC by Mijung Park, which uses kernels and maximum mean discrepancy to avoid defining summary statistics, and a version of parallel MCMC by Guillaume Basse. Plus another session on deep learning.

As usual with AISTATS conferences, the central activity of the day was the noon poster session, including speakers discussing their paper, and I had several interesting chats about MCMC related topics, with e.g. one alternative notion of ensemble MCMC [centred on estimating the normalising constant].

We awarded the notable student paper awards before the welcoming cocktail: The winners are Bo DaiNedelina Teneva, and Ye Wang.  And this first day ended up with a companionable evening in a most genuine tapa bar, tasting local blood sausage and local blue cheese. (If you do not mind the corrida theme!)

## gap frequencies [& e]

Posted in Kids, R with tags , , , on April 29, 2016 by xi'an

A riddle from The Riddler where brute-force simulation does not pay:

For a given integer N, pick at random without replacement integers between 1 and N by prohibiting consecutive integers until all possible entries are exhausted. What is the frequency of selected integers as N grows to infinity?

A simple implementation of the random experiment is as follows

generope=function(N){
frei=rep(1,N)
hus=0
while (max(frei)==1){
i=sample(rep((1:N)[frei==1],2),1)
frei[i]=frei[i+1]=frei[i-1]=0
hus=hus+1}
return(hus)}


It is however quite slow and does not exploit the recursive feature of the sampling, namely that the first draw breaks the set {1,…,N} into two sets:

generipe=function(N){
if (N<2){ return((N>0))}else{
i=sample(1:N,1)
return(1+generipe(i-2)+generipe(N-i-1))}}


But even this faster solution takes a while to run for large values of N:

frqns=function(N){
space=0
for (t in 1:1e3) space=space+generipe(N)
return(space/(N*1e3))}


as for instance

>  microbenchmark(frqns(100),time=10)
Unit: nanoseconds
expr       min        lq         mean    median        uq       max
frqns(100) 178720117 185020903 212212355.77 188710872 205865816 471395620
time         4         8        26.13        32        37       102


Hence this limits the realisation of simulation to, say, N=10⁴. Thinking further about the recursive aspect of the process however leads to a recursion on the frequencies qN, as it is straightforward to prove that

$q_N=\frac{1}{N}+\frac{2}{N^2}\,\sum_{i=1}^{N-2} iq_i$

with q1=1 and q2=1/2. This recursion can be further simplified into

$q_N=\frac{1}{N^2}+\frac{2(N-2)}{N^2}\,q_{N-2}+\frac{(N-1)^2}{N^2}q_{N-1}\qquad(1)$

which allows for a much faster computation

s=seq(1,1e7) #s[n]=n*q[n]
for (n in 3:1e7) s[n]=(1+2*s[n-2]+(n-1)*s[n-1])/n


and a limiting value of 0.4323324… Since storing s does not matter, a sliding update works even better:

a=b=1
for (n in 3:1e8){ c=(1+2*a+(n-1)*b)/n;a=b;b=c}


still producing a final value of 0.4323324, which may mean we have reached some limit in the precision.

As I could not figure out a way to find the limit of the sequence (1) above, I put it on the maths forum of Stack Exchange and very quickly got the answer (obtained by a power series representation) that the limit is (rather amazingly!)

$\dfrac{1 - e^{-2}}{2}$

which is 0.432332358.., hence very close to the numerical value obtained for n=3×10⁸. (Which does not change from n=10⁸, once again for precision reasons.) Now I wonder whether or not an easier derivation of the above is feasible, but we should learn about it in a few hours on The Riddler. [Update: The solution published by The Riddler is exactly that one, using a power series expansion to figure out the limit of the series, unfortunately. I was hoping for a de Montmort trick or sort of…]

## Le Monde puzzle [#960]

Posted in Kids, R with tags , , , , , on April 28, 2016 by xi'an

An arithmetic Le Monde mathematical puzzle:

Given an integer k>1, consider the sequence defined by F(1)=1+1 mod k, F²(1)=F(1)+2 mod k, F³(1)=F²(1)+3 mod k, &tc. [With this notation, F is not necessarily a function.] For which value of k is the sequence the entire {0,1,…,k-1} set?

This leads to an easy brute force resolution, for instance writing the R function

crkl<-function(k) return(unique(cumsum(1:(2*k))%%k))


where 2k is a sufficient substitute for ∞. Then the cases where the successive images of 1 visit the entire set {0,1,…,k-1} are given by

> for (i in 2:550) if (length(crkl(i))==i) print(i)
[1] 2
[1] 4
[1] 8
[1] 16
[1] 32
[1] 64
[1] 128
[1] 256
[1] 512


which suspiciously looks like the result that only the powers of 2 k=2,2²,2³,… lead to a complete exploration of the set {0,1,…,k-1}. Checking a few series in the plane back from Warwick, I quickly found that when k is odd, (1) the sequence is of period k and (2) there is symmetry in the sequence, which means it only takes (k-1)/2 values. For k even, there is a more complicated symmetry, with the sequence being of period 2k, symmetric around its two middle values, and taking the values 1,2,..,1+k(2k+1)/4,..,1+k(k+1)/2. Those values cannot cover the set {0,1,…,k-1} if two are equal, which means an i(i+1)/2 congruent to zero modulo k, hence equal to k. This is clearly impossible when k is a power of 2 because i and i+1 cannot both divide a power of 2. I waited for the published solution as of yesterday’s and the complete argument was to show that when N=2p, the corresponding sequence [for N] is made (modulo p) of the sequence for p plus the same sequence translated by p. The one for N is complete only if the one for p is complete, which by recursion eliminates all cases but the powers of 2…

## Le Monde puzzle [#959]

Posted in Kids, R with tags , , , on April 20, 2016 by xi'an

Another of those arithmetic Le Monde mathematical puzzle:

Find an integer A such that A is the sum of the squares of its four smallest dividers (including1) and an integer B such that B is the sum of the third poser of its four smallest factors. Are there such integers for higher powers?

This begs for a brute force resolution checking the integers until a solution appears. The only exciting part is providing the four smallest factors but a search on Stack overflow led to an existing R function:

FUN <- function(x) {
x <- as.integer(x)
div <- seq_len(abs(x))
return(div[x %% div == 0L])
}


(which uses the 0L representation I was unaware of) and hence my R code:

quest1<-function(n=2){
I=4
stop=TRUE
while ((stop)&(I<1e6)){
I=I+1
dive=FUN(I)
if (length(dive)>3)
stop=(I!=sum(sort(dive)[1:4]^n))
}
return(I)
}


But this code only seems to work for n=2 as produces A=130: it does not return any solution for the next value of n… As shown by the picture below, which solely exhibits a solution for n=2,5, A=17864 (in the second case), there is no solution less than 10⁶ for n=3,4,6,..9. So, unless I missed a point in the question, the solutions for n>2 are larger if they at all exist.

A resolution got published yesterday night in Le Monde  and (i) there is indeed no solution for n=3 (!), (ii) there are solutions for n=4 (1,419,874) and n=5 (1,015,690), which are larger than the 10⁶ bound I used in the R code, (iii) there is supposedly no solution for n=5!, when the R code found that 17,864=1⁵+2⁵+4⁵+7⁵… It is far from the first time the solution is wrong or incomplete!