where did the normalising constants go?! [part 2]

Posted in R, Statistics, Travel with tags , , , , , , , on March 12, 2014 by xi'an

Coming (swiftly and smoothly) back home after this wonderful and intense week in Banff, I hugged my loved ones,  quickly unpacked, ran a washing machine, and  then sat down to check where and how my reasoning was wrong. To start with, I experimented with a toy example in R:

# true target is (x^.7(1-x)^.3) (x^1.3 (1-x)^1.7)
# ie a Beta(3,3) distribution

# samples from partial posteriors
N=10^5
sam1=rbeta(N,1.7,1.3)
sam2=rbeta(N,2.3,2.7)

# first version: product of density estimates
dens1=density(sam1,from=0,to=1)
dens2=density(sam2,from=0,to=1)
prod=dens1$y*dens2$y
# normalising by hand
prod=prod*length(dens1$x)/sum(prod) plot(dens1$x,prod,type="l",col="steelblue",lwd=2)

# second version: F-S & P's yin+yang sampling
# with weights proportional to the other posterior

subsam1=sam1[sample(1:N,N,prob=dbeta(sam1,2.3,2.7),rep=T)]
plot(density(subsam1,from=0,to=1),col="steelblue",lwd=2)

subsam2=sam2[sample(1:N,N,prob=dbeta(sam2,1.7,1.3),rep=T)]
plot(density(subsam2,from=0,to=1),col="steelblue",lwd=2)


and (of course!) it produced the perfect fits reproduced below. Writing the R code acted as a developing bath as it showed why we could do without the constants!

Of course”, because the various derivations in the above R code all are clearly independent from the normalising constant: (i) when considering a product of kernel density estimators, as in the first version, this is an approximation of

$\prod_{i=1}^k p_i(\theta)$

as well as of

$\prod_{ i}^k m_i(\theta)$

since the constant does not matter. (ii) When considering a sample from mi and weighting it by the product of the remaining true or estimated mj‘s, this is a sampling weighting resampling simulation from the density proportional to the product and hence, once again, the constants do not matter. At last, (iii) when mixing the two subsamples, since they both are distributed from the product density, the constants do not matter. As I slowly realised when running this morning (trail-running, not code-runninh!, for the very first time in ten days!), the straight-from-the-box importance sampling version on the mixed samples I considered yesterday (to the point of wondering out loud where did the constants go) is never implemented in the cited papers. Hence, the fact that

$\prod_i^k p_i(\theta)\propto \prod_{i}^k m_i(\theta)$

is enough to justify handling the target directly as the product of the partial marginals. End of the mystery. Anticlimactic end, sorry…

where did the normalising constants go?! [part 1]

Posted in R, Statistics, Travel with tags , , , , on March 11, 2014 by xi'an

When listening this week to several talks in Banff handling large datasets or complex likelihoods by parallelisation, splitting the posterior as

$\prod_{i=1}^k p_i(\theta)$

and handling each term of this product on a separate processor or thread as proportional to a probability density,

$p_i(\theta)\propto m_i(\theta)=\omega_i p_i(\theta),$

then producing simulations from the mi‘s and attempting at deriving simulations from the original product, I started to wonder where all those normalising constants went. What vaguely bothered me for a while, even prior to the meeting, and then unclicked thanks to Sylvia’s talk yesterday was the handling of the normalising constants ωi by those different approaches… Indeed, it seemed to me that the samples from the mi‘s should be weighted by

$\omega_i\prod_{j\ne i}^k p_j(\theta)$

rather than just

$\prod_{j\ne i}^k p_j(\theta)$

or than the product of the other posteriors

$\prod_{j\ne i}^k m_j(\theta)$

which makes or should make a significant difference. For instance, a sheer importance sampling argument for the aggregated sample exhibited those weights

$\mathbb{E}[h(\theta_i)\prod_{i=1}^k p_i(\theta_i)\big/m_i(\theta_i)]=\omega_i^{-1}\int h(\theta_i)\prod_{j}^k p_j(\theta_i)\text{d}\theta_i$

Hence processing the samples on an equal footing or as if the proper weight was the product of the other posteriors mj should have produced a bias in the resulting sample. This was however the approach in both Scott et al.‘s and Neiswanger et al.‘s perspectives. As well as Wang and Dunson‘s, who also started from the product of posteriors. (Normalizing constants are considered in, e.g., Theorem 1, but only for the product density and its Weierstrass convolution version.) And in Sylvia’s talk. Such a consensus of high calibre researchers cannot get it wrong! So I must have missed something: what happened is that the constants eventually did not matter, as expanded in the next post

R finals

Posted in R, Statistics, University life with tags , , , , , , , , on January 31, 2013 by xi'an

On the morning I returned from Varanasi and the ISBA meeting there, I had to give my R final exam (along with three of my colleagues in Paris-Dauphine). This year, the R course was completely in English, exam included, which means I can post it here as it may attract more interest than the French examens of past years…

I just completed grading my 32 copies, all from exam A, which takes a while as I have to check (and sometimes recover) the R code, and often to correct the obvious mistakes to see if the deeper understanding of the concepts is there. This year student cohort is surprisingly homogeneous: I did not spot any of the horrors I may have mentioned in previous posts.

I must alas acknowledge a grievous typo in the version of Exam B that was used the day of the final: cutting-and-pasting from A to B, I forgot to change the parameters in Exercise 2, asking them to simulate a Gamma(0,1). It is only after half an hour that a bright student pointed out the impossibility… We had tested the exams prior to printing them but this somehow escaped the four of us!

Now, as I was entering my grades into the global spreadsheet, I noticed a perfect… lack of correlation between those and the grades at the midterm exam. I wonder what that means: I could be grading at random, the levels in November and in January could be uncorrelated, some students could have cheated in November and others in January, student’s names or file names got mixed up, …? A rather surprising outcome!

estimating the measure and hence the constant

Posted in pictures, Running, Statistics, University life with tags , , , , , , , on December 6, 2012 by xi'an

As mentioned on my post about the final day of the ICERM workshop, Xiao-Li Meng addresses this issue of “estimating the constant” in his talk. It is even his central theme. Here are his (2011) slides as he sent them to me (with permission to post them!):

He therefore points out in slide #5 why the likelihood cannot be expressed in terms of the normalising constant because this is not a free parameter. Right! His explanation for the approximation of the unknown constant is then to replace the known but intractable dominating measure—in the sense that it cannot compute the integral—with a discrete (or non-parametric) measure supported by the sample. Because the measure is defined up to a constant, this leads to sample weights being proportional to the inverse density. Of course, this representation of the problem is open to criticism: why focus only on measures supported by the sample? The fact that it is the MLE is used as an argument in Xiao-Li’s talk, but this can alternatively be seen as a drawback: I remember reviewing Dankmar Böhning’s Computer-Assisted Analysis of Mixtures and being horrified when discovering this feature! I am currently more agnostic since this appears as an alternative version of empirical likelihood. There are still questions about the measure estimation principle: for instance, when handling several samples from several distributions, why should they all contribute to a single estimate of μ rather than to a product of measures? (Maybe because their models are all dominated by the same measure μ.) Now, getting back to my earlier remark, and as a possible answer to Larry’s quesiton, there could well be a Bayesian version of the above, avoiding the rough empirical likelihood via Gaussian or Drichlet process prior modelling.

ICERM, Brown, Providence, RI (#3)

Posted in pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , on December 3, 2012 by xi'an

Just yet another perfect day in Providence! Especially when I thought it was going to be a half-day: After a longer and slightly warmer run in the early morning around the peninsula, I attended the lecture by Eric Moulines on his recent results on adaptive MCMC and the equi-energy sampler. At this point, we were told that, since Peter Glynn was sick, the afternoon talks were drifted forward. This meant that I could attend Mylène Bédard’s talk in the morning and most of Xiao-Li Meng’s talk, before catching my bus to the airport, making it a full day in the end!

The research presented by Mylène (and coauthored with Randal Douc and Eric Moulines) was on multiple-try MCMC and delayed-rejection MCMC, with optimal scaling results and a comparison of the efficiency of those involved schemes. I had not seen the work before and got quite impressed by the precision of the results and the potential for huge efficiency gains. One of the most interesting tricks was to use an antithetic move for the second step, considerably improving the acceptance rate in the process. An aside exciting point was to realise that the hit-and-run solution was also open to wide time-savings thanks to some factorisation.

While Xiao-Li’s talk had connections with his earlier illuminating talk in New York last year, I am quite desolate to have missed [the most novel] half of it (and still caught my bus by a two minute margin!), esp. because it connected beautifully with the constant estimation controverse! Indeed, Xiao-Li started his presentation with the pseudo-paradox that the likelihood cannot be written as a function of the normalising constant, simply because this is not a free parameter. He then switched to his usual theme that the dominating measure was to be replaced with a substitute and estimated.The normalising constant being a function of the dominating measure, it is a by-product of this estimation step. And can even be endowed within a Bayesian framework. Obviously, one can always argue against the fact that the dominating measure is truly unknown, however this gives a very elegant safe-conduct to escape the debate about the constant that did not want to be estimated…So to answer Xiao-Li’s question as I was leaving the conference room, I have now come to a more complete agreement with his approach. And think further advances could be contemplated along this path…