## PMC for combinatoric spaces

Posted in Statistics, University life with tags , , , , , , , on July 28, 2014 by xi'an

I received this interesting [edited] email from Xiannian Fan at CUNY:

I am trying to use PMC to solve Bayesian network structure learning problem (which is in a combinatorial space, not continuous space).

In PMC, the proposal distributions qi,t can be very flexible, even specific to each iteration and each instance. My problem occurs due to the combinatorial space.

For importance sampling, the requirement for proposal distribution, q, is:

support (p) ⊂ support (q)             (*)

For PMC, what is the support of the proposal distribution in iteration t? is it

support (p) ⊂ U support(qi,t)    (**)

or does (*) apply to every qi,t?

For continuous problem, this is not a big issue. We can use random walk of Normal distribution to do local move satisfying (*). But for combination search, local moving only result in finite states choice, just not satisfying (*). For example for a permutation (1,3,2,4), random swap has only choose(4,2)=6 neighbor states.

Fairly interesting question about population Monte Carlo (PMC), a sequential version of importance sampling we worked on with French colleagues in the early 2000’s.  (The name population Monte Carlo comes from Iba, 2000.)  While MCMC samplers do not have to cover the whole support of p at each iteration, it is much harder for importance samplers as their core justification is to provide an unbiased estimator to for all integrals of interest. Thus, when using the PMC estimate,

1/n ∑i,t {p(xi,t)/qi,t(xi,t)}h(qi,t),  xi,t~qi,t(x)

this estimator is only unbiased when the supports of the qi,t “s are all containing the support of p. The only other cases I can think of are

1. associating the qi,t “s with a partition Si,t of the support of p and using instead

i,t {p(xi,t)/qi,t(xi,t)}h(qi,t), xi,t~qi,t(x)

2. resorting to AMIS under the assumption (**) and using instead

1/n ∑i,t {p(xi,t)/∑j,t qj,t(xi,t)}h(qi,t), xi,t~qi,t(x)

but I am open to further suggestions!

## another R new trick [new for me!]

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , , on July 16, 2014 by xi'an

While working with Andrew and a student from Dauphine on importance sampling, we wanted to assess the distribution of the resulting sample via the Kolmogorov-Smirnov measure

$\max_x |\hat{F_n}(x)-F(x)|$

where F is the target.  This distance (times √n) has an asymptotic distribution that does not depend on n, called the Kolmogorov distribution. After searching for a little while, we could not figure where this distribution was available in R. It had to, since ks.test was returning a p-value. Hopefully correct! So I looked into the ks.test function, which happens not to be entirely programmed in C, and found the line

PVAL <- 1 - if (alternative == "two.sided")
.Call(C_pKolmogorov2x, STATISTIC, n)


which means that the Kolmogorov distribution is coded as a C function C_pKolmogorov2x in R. However, I could not call the function myself.

> .Call(C_pKolmogorov2x,.3,4)


Hence, as I did not want to recode this distribution cdf, I posted the question on stackoverflow (long time no see!) and got a reply almost immediately as to use the package kolmim. Followed by the extra comment from the same person that calling the C code only required to add the path to its name, as in

> .Call(stats:::C_pKolmogorov2x,STAT=.3,n=4)
[1] 0.2292


## R/Rmetrics in Paris [alas!]

Posted in Mountains, pictures, R, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , on June 30, 2014 by xi'an

Today I gave a talk on Bayesian model choice in a fabulous 13th Century former monastery in the Latin Quarter of Paris… It is the Collège des Bernardins, close to Jussieu and Collège de France, unbelievably hidden to the point I was not aware of its existence despite having studied and worked in Jussieu since 1982… I mixed my earlier San Antonio survey on importance sampling approximations to Bayes factors with an entry to our most recent work on ABC with random forests. This was the first talk of the 8th R/Rmetrics workshop taking place in Paris this year. (Rmetrics is aiming at aggregating R packages with econometrics and finance applications.) And I had a full hour and a half to deliver my lecture to the workshop audience. Nice place, nice people, new faces and topics (and even andouille de Vire for lunch!): why should I complain with an alas in the title?!What happened is that the R/Rmetrics meetings have been till this year organised in Meielisalp, Switzerland. Which stands on top of Thuner See and… just next to the most famous peaks of the Bernese Alps! And that I had been invited last year but could not make it… Meaning I lost a genuine opportunity to climb one of my five dream routes, the Mittelegi ridge of the Eiger. As the future R/Rmetrics meetings will not take place there.

A lunch discussion at the workshop led me to experiment the compiler library in R, library that I was unaware of. The impact on the running time is obvious: recycling the fowler function from the last Le Monde puzzle,

> bowler=cmpfun(fowler)
> N=20;n=10;system.time(fowler(pred=N))
user  system elapsed
52.647   0.076  56.332
> N=20;n=10;system.time(bowler(pred=N))
user  system elapsed
51.631   0.004  51.768
> N=20;n=15;system.time(bowler(pred=N))
user  system elapsed
51.924   0.024  52.429
> N=20;n=15;system.time(fowler(pred=N))
user  system elapsed
52.919   0.200  61.960


shows a ten- to twenty-fold gain in system time, if not in elapsed time (re-alas!).

## checking for finite variance of importance samplers

Posted in R, Statistics, Travel, University life with tags , , , , , , , , on June 11, 2014 by xi'an

Over a welcomed curry yesterday night in Edinburgh I read this 2008 paper by Koopman, Shephard and Creal, testing the assumptions behind importance sampling, which purpose is to check on-line for (in)finite variance in an importance sampler, based on the empirical distribution of the importance weights. To this goal, the authors use the upper tail  of the weights and a limit theorem that provides the limiting distribution as a type of Pareto distribution

$\dfrac{1}{\beta}\left(1+\xi z/\beta \right)^{-1-1/\xi}$

over (0,∞). And then implement a series of asymptotic tests like the likelihood ratio, Wald and score tests to assess whether or not the power ξ of the Pareto distribution is below ½. While there is nothing wrong with this approach, which produces a statistically validated diagnosis, I still wonder at the added value from a practical perspective, as raw graphs of the estimation sequence itself should exhibit similar jumps and a similar lack of stabilisation as the ones seen in the various figures of the paper. Alternatively, a few repeated calls to the importance sampler should disclose the poor convergence properties of the sampler, as in the above graph. Where the blue line indicates the true value of the integral.

## where did the normalising constants go?! [part 2]

Posted in R, Statistics, Travel with tags , , , , , , , on March 12, 2014 by xi'an

Coming (swiftly and smoothly) back home after this wonderful and intense week in Banff, I hugged my loved ones,  quickly unpacked, ran a washing machine, and  then sat down to check where and how my reasoning was wrong. To start with, I experimented with a toy example in R:

# true target is (x^.7(1-x)^.3) (x^1.3 (1-x)^1.7)
# ie a Beta(3,3) distribution

# samples from partial posteriors
N=10^5
sam1=rbeta(N,1.7,1.3)
sam2=rbeta(N,2.3,2.7)

# first version: product of density estimates
dens1=density(sam1,from=0,to=1)
dens2=density(sam2,from=0,to=1)
prod=dens1$y*dens2$y
# normalising by hand
prod=prod*length(dens1$x)/sum(prod) plot(dens1$x,prod,type="l",col="steelblue",lwd=2)

# second version: F-S & P's yin+yang sampling
# with weights proportional to the other posterior

subsam1=sam1[sample(1:N,N,prob=dbeta(sam1,2.3,2.7),rep=T)]
plot(density(subsam1,from=0,to=1),col="steelblue",lwd=2)

subsam2=sam2[sample(1:N,N,prob=dbeta(sam2,1.7,1.3),rep=T)]
plot(density(subsam2,from=0,to=1),col="steelblue",lwd=2)


and (of course!) it produced the perfect fits reproduced below. Writing the R code acted as a developing bath as it showed why we could do without the constants!

Of course”, because the various derivations in the above R code all are clearly independent from the normalising constant: (i) when considering a product of kernel density estimators, as in the first version, this is an approximation of

$\prod_{i=1}^k p_i(\theta)$

as well as of

$\prod_{ i}^k m_i(\theta)$

since the constant does not matter. (ii) When considering a sample from mi and weighting it by the product of the remaining true or estimated mj‘s, this is a sampling weighting resampling simulation from the density proportional to the product and hence, once again, the constants do not matter. At last, (iii) when mixing the two subsamples, since they both are distributed from the product density, the constants do not matter. As I slowly realised when running this morning (trail-running, not code-runninh!, for the very first time in ten days!), the straight-from-the-box importance sampling version on the mixed samples I considered yesterday (to the point of wondering out loud where did the constants go) is never implemented in the cited papers. Hence, the fact that

$\prod_i^k p_i(\theta)\propto \prod_{i}^k m_i(\theta)$

is enough to justify handling the target directly as the product of the partial marginals. End of the mystery. Anticlimactic end, sorry…