## independent component analysis and p-values

Posted in pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , on September 8, 2014 by xi'an

Last morning at the neuroscience workshop Jean-François Cardoso presented independent component analysis though a highly pedagogical and enjoyable tutorial that stressed the geometric meaning of the approach, summarised by the notion that the (ICA) decomposition

$X=AS$

of the data X seeks both independence between the columns of S and non-Gaussianity. That is, getting as away from Gaussianity as possible. The geometric bits came from looking at the Kullback-Leibler decomposition of the log likelihood

$-\mathbb{E}[\log L(\theta|X)] = KL(P,Q_\theta) + \mathfrak{E}(P)$

where the expectation is computed under the true distribution P of the data X. And Qθ is the hypothesised distribution. A fine property of this decomposition is a statistical version of Pythagoreas’ theorem, namely that when the family of Qθ‘s is an exponential family, the Kullback-Leibler distance decomposes into

$KL(P,Q_\theta) = KL(P,Q_{\theta^0}) + KL(Q_{\theta^0},Q_\theta)$

where θ⁰ is the expected maximum likelihood estimator of θ. (We also noticed this possibility of a decomposition in our Kullback-projection variable-selection paper with Jérôme Dupuis.) The talk by Aapo Hyvärinen this morning was related to Jean-François’ in that it used ICA all the way to a three-level representation if oriented towards natural vision modelling in connection with his book and the paper on unormalised models recently discussed on the ‘Og.

On the afternoon, Eric-Jan Wagenmaker [who persistently and rationally fight the (ab)use of p-values and who frequently figures on Andrew's blog] gave a warning tutorial talk about the dangers of trusting p-values and going fishing for significance in existing studies, much in the spirit of Andrew’s blog (except for the defence of Bayes factors). Arguing in favour of preregistration. The talk was full of illustrations from psychology. And included the line that ESP testing is the jester of academia, meaning that testing for whatever form of ESP should be encouraged as a way to check testing procedures. If a procedure finds a significant departure from the null in this setting, there is something wrong with it! I was then reminded that Eric-Jan was one of the authors having analysed Bem’s controversial (!) paper on the “anomalous processes of information or energy transfer that are currently unexplained in terms of known physical or biological mechanisms”… (And of the shocking talk by Jessica Utts on the same topic I attended in Australia two years ago.)

## this issue of Series B

Posted in Books, Statistics, Travel, University life with tags , , , , , , , , , , on September 5, 2014 by xi'an

The September issue of [JRSS] Series B I received a few days ago is of particular interest to me. (And not as an ex-co-editor since I was never involved in any of those papers!) To wit: a paper by Hani Doss and Aixin Tan on evaluating normalising constants based on MCMC output, a preliminary version I had seen at a previous JSM meeting, a paper by Nick Polson, James Scott and Jesse Windle on the Bayesian bridge, connected with Nick’s talk in Boston earlier this month, yet another paper by Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar and Michael Jordan on the bag of little bootstraps, which presentation I heard Michael deliver a few times when he was in Paris. (Obviously, this does not imply any negative judgement on the other papers of this issue!)

For instance, Doss and Tan consider the multiple mixture estimator [my wording, the authors do not give the method a name, referring to Vardi (1985) but missing the connection with Owen and Zhou (2000)] of k ratios of normalising constants, namely

$\sum_{l=1}^k \frac{1}{n_l} \sum_{t=1}^{n_l} \dfrac{n_l g_j(x_t^l)}{\sum_{s=1}^k n_s g_s(x_t^l) z_1/z_s } \longrightarrow \dfrac{z_j}{z_1}$

where the z’s are the normalising constants and with possible different numbers of iterations of each Markov chain. An interesting starting point (that Hans Künsch had mentioned to me a while ago but that I had since then forgotten) is that the problem was reformulated by Charlie Geyer (1994) as a quasi-likelihood estimation where the ratios of all z’s relative to one reference density are the unknowns. This is doubling interesting, actually, because it restates the constant estimation problem into a statistical light and thus somewhat relates to the infamous “paradox” raised by Larry Wasserman a while ago. The novelty in the paper is (a) to derive an optimal estimator of the ratios of normalising constants in the Markov case, essentially accounting for possibly different lengths of the Markov chains, and (b) to estimate the variance matrix of the ratio estimate by regeneration arguments. A favourite tool of mine, at least theoretically as practically useful minorising conditions are hard to come by, if at all available.

## where did the normalising constants go?! [part 2]

Posted in R, Statistics, Travel with tags , , , , , , , on March 12, 2014 by xi'an

Coming (swiftly and smoothly) back home after this wonderful and intense week in Banff, I hugged my loved ones,  quickly unpacked, ran a washing machine, and  then sat down to check where and how my reasoning was wrong. To start with, I experimented with a toy example in R:

# true target is (x^.7(1-x)^.3) (x^1.3 (1-x)^1.7)
# ie a Beta(3,3) distribution

# samples from partial posteriors
N=10^5
sam1=rbeta(N,1.7,1.3)
sam2=rbeta(N,2.3,2.7)

# first version: product of density estimates
dens1=density(sam1,from=0,to=1)
dens2=density(sam2,from=0,to=1)
prod=dens1$y*dens2$y
# normalising by hand
prod=prod*length(dens1$x)/sum(prod) plot(dens1$x,prod,type="l",col="steelblue",lwd=2)

# second version: F-S & P's yin+yang sampling
# with weights proportional to the other posterior

subsam1=sam1[sample(1:N,N,prob=dbeta(sam1,2.3,2.7),rep=T)]
plot(density(subsam1,from=0,to=1),col="steelblue",lwd=2)

subsam2=sam2[sample(1:N,N,prob=dbeta(sam2,1.7,1.3),rep=T)]
plot(density(subsam2,from=0,to=1),col="steelblue",lwd=2)


and (of course!) it produced the perfect fits reproduced below. Writing the R code acted as a developing bath as it showed why we could do without the constants!

Of course”, because the various derivations in the above R code all are clearly independent from the normalising constant: (i) when considering a product of kernel density estimators, as in the first version, this is an approximation of

$\prod_{i=1}^k p_i(\theta)$

as well as of

$\prod_{ i}^k m_i(\theta)$

since the constant does not matter. (ii) When considering a sample from mi and weighting it by the product of the remaining true or estimated mj‘s, this is a sampling weighting resampling simulation from the density proportional to the product and hence, once again, the constants do not matter. At last, (iii) when mixing the two subsamples, since they both are distributed from the product density, the constants do not matter. As I slowly realised when running this morning (trail-running, not code-runninh!, for the very first time in ten days!), the straight-from-the-box importance sampling version on the mixed samples I considered yesterday (to the point of wondering out loud where did the constants go) is never implemented in the cited papers. Hence, the fact that

$\prod_i^k p_i(\theta)\propto \prod_{i}^k m_i(\theta)$

is enough to justify handling the target directly as the product of the partial marginals. End of the mystery. Anticlimactic end, sorry…

## where did the normalising constants go?! [part 1]

Posted in R, Statistics, Travel with tags , , , , on March 11, 2014 by xi'an

When listening this week to several talks in Banff handling large datasets or complex likelihoods by parallelisation, splitting the posterior as

$\prod_{i=1}^k p_i(\theta)$

and handling each term of this product on a separate processor or thread as proportional to a probability density,

$p_i(\theta)\propto m_i(\theta)=\omega_i p_i(\theta),$

then producing simulations from the mi‘s and attempting at deriving simulations from the original product, I started to wonder where all those normalising constants went. What vaguely bothered me for a while, even prior to the meeting, and then unclicked thanks to Sylvia’s talk yesterday was the handling of the normalising constants ωi by those different approaches… Indeed, it seemed to me that the samples from the mi‘s should be weighted by

$\omega_i\prod_{j\ne i}^k p_j(\theta)$

rather than just

$\prod_{j\ne i}^k p_j(\theta)$

or than the product of the other posteriors

$\prod_{j\ne i}^k m_j(\theta)$

which makes or should make a significant difference. For instance, a sheer importance sampling argument for the aggregated sample exhibited those weights

$\mathbb{E}[h(\theta_i)\prod_{i=1}^k p_i(\theta_i)\big/m_i(\theta_i)]=\omega_i^{-1}\int h(\theta_i)\prod_{j}^k p_j(\theta_i)\text{d}\theta_i$

Hence processing the samples on an equal footing or as if the proper weight was the product of the other posteriors mj should have produced a bias in the resulting sample. This was however the approach in both Scott et al.‘s and Neiswanger et al.‘s perspectives. As well as Wang and Dunson‘s, who also started from the product of posteriors. (Normalizing constants are considered in, e.g., Theorem 1, but only for the product density and its Weierstrass convolution version.) And in Sylvia’s talk. Such a consensus of high calibre researchers cannot get it wrong! So I must have missed something: what happened is that the constants eventually did not matter, as expanded in the next post

## R finals

Posted in R, Statistics, University life with tags , , , , , , , , on January 31, 2013 by xi'an

On the morning I returned from Varanasi and the ISBA meeting there, I had to give my R final exam (along with three of my colleagues in Paris-Dauphine). This year, the R course was completely in English, exam included, which means I can post it here as it may attract more interest than the French examens of past years…

I just completed grading my 32 copies, all from exam A, which takes a while as I have to check (and sometimes recover) the R code, and often to correct the obvious mistakes to see if the deeper understanding of the concepts is there. This year student cohort is surprisingly homogeneous: I did not spot any of the horrors I may have mentioned in previous posts.

I must alas acknowledge a grievous typo in the version of Exam B that was used the day of the final: cutting-and-pasting from A to B, I forgot to change the parameters in Exercise 2, asking them to simulate a Gamma(0,1). It is only after half an hour that a bright student pointed out the impossibility… We had tested the exams prior to printing them but this somehow escaped the four of us!

Now, as I was entering my grades into the global spreadsheet, I noticed a perfect… lack of correlation between those and the grades at the midterm exam. I wonder what that means: I could be grading at random, the levels in November and in January could be uncorrelated, some students could have cheated in November and others in January, student’s names or file names got mixed up, …? A rather surprising outcome!

## estimating the measure and hence the constant

Posted in pictures, Running, Statistics, University life with tags , , , , , , , on December 6, 2012 by xi'an

As mentioned on my post about the final day of the ICERM workshop, Xiao-Li Meng addresses this issue of “estimating the constant” in his talk. It is even his central theme. Here are his (2011) slides as he sent them to me (with permission to post them!):

He therefore points out in slide #5 why the likelihood cannot be expressed in terms of the normalising constant because this is not a free parameter. Right! His explanation for the approximation of the unknown constant is then to replace the known but intractable dominating measure—in the sense that it cannot compute the integral—with a discrete (or non-parametric) measure supported by the sample. Because the measure is defined up to a constant, this leads to sample weights being proportional to the inverse density. Of course, this representation of the problem is open to criticism: why focus only on measures supported by the sample? The fact that it is the MLE is used as an argument in Xiao-Li’s talk, but this can alternatively be seen as a drawback: I remember reviewing Dankmar Böhning’s Computer-Assisted Analysis of Mixtures and being horrified when discovering this feature! I am currently more agnostic since this appears as an alternative version of empirical likelihood. There are still questions about the measure estimation principle: for instance, when handling several samples from several distributions, why should they all contribute to a single estimate of μ rather than to a product of measures? (Maybe because their models are all dominated by the same measure μ.) Now, getting back to my earlier remark, and as a possible answer to Larry’s quesiton, there could well be a Bayesian version of the above, avoiding the rough empirical likelihood via Gaussian or Drichlet process prior modelling.

## ICERM, Brown, Providence, RI (#3)

Posted in pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , on December 3, 2012 by xi'an

Just yet another perfect day in Providence! Especially when I thought it was going to be a half-day: After a longer and slightly warmer run in the early morning around the peninsula, I attended the lecture by Eric Moulines on his recent results on adaptive MCMC and the equi-energy sampler. At this point, we were told that, since Peter Glynn was sick, the afternoon talks were drifted forward. This meant that I could attend Mylène Bédard’s talk in the morning and most of Xiao-Li Meng’s talk, before catching my bus to the airport, making it a full day in the end!

The research presented by Mylène (and coauthored with Randal Douc and Eric Moulines) was on multiple-try MCMC and delayed-rejection MCMC, with optimal scaling results and a comparison of the efficiency of those involved schemes. I had not seen the work before and got quite impressed by the precision of the results and the potential for huge efficiency gains. One of the most interesting tricks was to use an antithetic move for the second step, considerably improving the acceptance rate in the process. An aside exciting point was to realise that the hit-and-run solution was also open to wide time-savings thanks to some factorisation.

While Xiao-Li’s talk had connections with his earlier illuminating talk in New York last year, I am quite desolate to have missed [the most novel] half of it (and still caught my bus by a two minute margin!), esp. because it connected beautifully with the constant estimation controverse! Indeed, Xiao-Li started his presentation with the pseudo-paradox that the likelihood cannot be written as a function of the normalising constant, simply because this is not a free parameter. He then switched to his usual theme that the dominating measure was to be replaced with a substitute and estimated.The normalising constant being a function of the dominating measure, it is a by-product of this estimation step. And can even be endowed within a Bayesian framework. Obviously, one can always argue against the fact that the dominating measure is truly unknown, however this gives a very elegant safe-conduct to escape the debate about the constant that did not want to be estimated…So to answer Xiao-Li’s question as I was leaving the conference room, I have now come to a more complete agreement with his approach. And think further advances could be contemplated along this path…