## where did the normalising constants go?! [part 2]

Posted in R, Statistics, Travel with tags , , , , , , , on March 12, 2014 by xi'an

Coming (swiftly and smoothly) back home after this wonderful and intense week in Banff, I hugged my loved ones,  quickly unpacked, ran a washing machine, and  then sat down to check where and how my reasoning was wrong. To start with, I experimented with a toy example in R:

# true target is (x^.7(1-x)^.3) (x^1.3 (1-x)^1.7)
# ie a Beta(3,3) distribution

# samples from partial posteriors
N=10^5
sam1=rbeta(N,1.7,1.3)
sam2=rbeta(N,2.3,2.7)

# first version: product of density estimates
dens1=density(sam1,from=0,to=1)
dens2=density(sam2,from=0,to=1)
prod=dens1$y*dens2$y
# normalising by hand
prod=prod*length(dens1$x)/sum(prod) plot(dens1$x,prod,type="l",col="steelblue",lwd=2)

# second version: F-S & P's yin+yang sampling
# with weights proportional to the other posterior

subsam1=sam1[sample(1:N,N,prob=dbeta(sam1,2.3,2.7),rep=T)]
plot(density(subsam1,from=0,to=1),col="steelblue",lwd=2)

subsam2=sam2[sample(1:N,N,prob=dbeta(sam2,1.7,1.3),rep=T)]
plot(density(subsam2,from=0,to=1),col="steelblue",lwd=2)


and (of course!) it produced the perfect fits reproduced below. Writing the R code acted as a developing bath as it showed why we could do without the constants!

Of course”, because the various derivations in the above R code all are clearly independent from the normalising constant: (i) when considering a product of kernel density estimators, as in the first version, this is an approximation of

$\prod_{i=1}^k p_i(\theta)$

as well as of

$\prod_{ i}^k m_i(\theta)$

since the constant does not matter. (ii) When considering a sample from mi and weighting it by the product of the remaining true or estimated mj‘s, this is a sampling weighting resampling simulation from the density proportional to the product and hence, once again, the constants do not matter. At last, (iii) when mixing the two subsamples, since they both are distributed from the product density, the constants do not matter. As I slowly realised when running this morning (trail-running, not code-runninh!, for the very first time in ten days!), the straight-from-the-box importance sampling version on the mixed samples I considered yesterday (to the point of wondering out loud where did the constants go) is never implemented in the cited papers. Hence, the fact that

$\prod_i^k p_i(\theta)\propto \prod_{i}^k m_i(\theta)$

is enough to justify handling the target directly as the product of the partial marginals. End of the mystery. Anticlimactic end, sorry…

## particle efficient importance sampling

Posted in Statistics with tags , , , , , , on October 15, 2013 by xi'an

Marcel Scharth and Robert Kohn just arXived a new article entitled “particle efficient importance sampling“. What is—the efficiency—about?! The spectacular diminution in variance—(the authors mention a factor of 6,000 when compared with regular particle filters!—in a stochastic volatility simulation study.

If I got the details right, the improvement stems from a paper by Richard and Zhang (Journal of  Econometrics, 2007). In a state-space/hidden Markov model setting, (non-sequential) importance sampling tries to approximate the smoothing distribution one term at a time, ie p(xt|xt-1,y1:n), but Richard and Zhang (2007) modify the target by looking at

p(yt|xt)p(xt|xt-1(xt-1,y1:n),

where the last term χ(xt-1,y1:n) is the normalising constant of the proposal kernel for the previous (in t-1) target, k(xt-1|xt-2,y1:n). This kernel is actually parameterised as k(xt-1|xt-2,at(y1:n)) and the EIS algorithm optimises those parameters, one term at a time. The current paper expands Richard and Zhang (2007) by using particles to approximate the likelihood contribution and reduce the variance once the “optimal” EIS solution is obtained. (They also reproduce Richard’s and Zhang’s tricks of relying on the same common random numbers.

This approach sounds like a “miracle” to me, in the sense(s) that (a) the “normalising constant” is far from being uniquely defined (and just as far from being constant in the parameter at) and (b) it is unrelated with the target distribution (except for the optimisation step). In the extreme case when the normalising constant is also constant… in at, this step clearly is useless. (This also opens the potential for an optimisation in the choice of χ(xt-1,y1:n)…)

The simulation study starts from a univariate stochastic volatility model relying on two hidden correlated AR(1) models. (There may be a typo in the definition in Section 4.1, i.e. a Φi missing.) In those simulations, EIS brings a significant variance reduction when compared with standard particle filters and particle EIS further improves upon EIS by a factor of 2 to 20 (in the variance). I could not spot in the paper which choice had been made for χ()… which is annoying as I gathered from my reading that it must have a strong impact on the efficiency attached to the name of the method!

## IS² for Bayesian inference

Posted in Statistics, University life with tags , , , , on September 26, 2013 by xi'an

“…the method of Approximate Bayesian Computation (ABC) may be used to estimate unbiasedly an approximation to the likelihood.”

Minh-Ngoc Tran, Marcel Scharth, Michael Pitt and Robert Kohn arXived a paper on using an unbiased estimate of the likelihood in lieu of the genuine thing and still getting convergence to the right thing. While the spirit of the paper is in the same spirit as the fundamental paper of  Andrieu and Roberts (2009, AoS, somewhat surprisingly missing from the reference), comparing the asymptotic efficiency of using an estimate versus using the genuine likelihood, my attention was distracted by the above quote. This is the only sentence (besides the abstract) where ABC is mentioned and I was a bit confused: ABC is used to estimate an approximation to the likelihood, for sure, converging to

$\int_{d(x,x^\text{obs})\le\varepsilon} f(x|\theta)\,\text{d}\theta$

as the number of pseudo-datasets grows to infinity and it is unbiased on this sense, but this is not the reason for using ABC, as the ABC pseudo-likelihood above is the (by)product of the methodology rather than the genuine quantity of interest. Reading the sentence too fast gave me the feeling that ABC did produce an unbiased approximation to the genuine likelihood! Distracted I was, since this is not at all the point of the paper! However, I would be curious to see how it applies to ABC.

The core result is the convergence of an importance sampling estimator using a likelihood estimated by importance sampling (hence the IS², also inspired by SMC²),. The trick in the proof is to turn the computation of  the likelihood estimand into the production of an (unobserved or “implicitly generated”) auxiliary variable and then to rewrite the original estimator as a genuine importance estimator. (This seems to imply the derivation of an independent importance sampling estimator of the likelihood at each iteration, right?) Standard convergence results then follow, except that the asymptotic variance has an extra term. And except that the estimator of the likelihood does not have to converge, i.e. can keep a fixed number of terms and a positive variance. The second part of the paper establishes that using an estimate degrades the asymptotic variance.

## another vanilla Rao-Blackwellisation

Posted in Statistics, University life with tags , , , , , on September 16, 2013 by xi'an

In the latest issue of Statistics and Computing (2013, Issue 23, pages 577-587), Iliopoulos and Malefaki published a paper that relates to our vanilla Rao-Blackwellisation paper with Randal Douc. The idea is to derive another approximation to the ideal importance sampling weight using the “accepted” Markov chain. (With Randal, we had a Bernoulli factory representation.) The density g(x) of the accepted chain being unknown; it is represented as the expectation under π of the function

$\min\left\{q(z|x)/\pi(z),q(x|z)/\pi(x)\right\}$

and hence estimated by a self-normalised average based on the whole Markov chain. This means the resulting importance estimate uses twice the output of the algorithm and that it is biased and of order O(n²), thus the same order as our original Rao-Blackwellised estimator (Robert & Casella, 1996)… This also means convergence and CLT are very hard to establish: the main convergence theorem of the paper holds only for finite state spaces, with a surprising smaller asymptotic variance for this self-normalised average than for the ideal importance sampling estimator in the independent Metropolis-Hastings case. (Both are biased by being self-normalised and the paper does not consider the magnitude of those biases.)

Interestingly, the authors also ran a comparison with our parallelised Rao-Blackwellised version (with Pierre Jacob and Murray Smith), but conclude (P.58) at a larger CPU (should be GPU!!) required by the parallelisation, which does not really make sense: when compared with the plain Metropolis-Hastings implementation, run on a single processor, the parallel version only requires an extra random permutation per thread or per processor. I thus suspect a faulty implementation that induces this CPU being linear in the size of the blocks, like maybe only saving one output per block… Also interestingly, the paper re-analyses the Pima Indian probit model Jean-Michel Marin and I (and many others) used as benchmark in several of our papers. As in the most standard examples, the outcome shows a mild reduction in variance when using this estimated importance sampling version. Maybe a comparison with the ideal importance sampler (i.e. the one that does not divide by the sum of the weights since using normalised versions of the target and importance densities) would have helped in the comparison.

## arXiv recent postings

Posted in Statistics, University life with tags , , , , , , , on June 14, 2013 by xi'an

As I glanced thru the recent arXiv postings in statistics, I found an overload of potential goodies, many more than I could handle in a reasonable time (unless I give up the NYT altogether!)…

For instance, Paulo Marques—a regular ‘Og commenter—wrote about an extension of Chib’s formula when the posterior is approximated in a non-parametric manner. (This was an idea I had toyed with at a time without pursuing it anywhere…)   I wonder at the stability of the approximation for two reasons: (i) Chib’s or Bayes’ formula does not require an expectation as the ratio is constant in θ; averaging over realisations of θ could have negative effects on the stability of the approximation (or not); (ii) using a non-parametric estimate will see the performances of Chib’s approximation plummet as the dimension of θ increases, most likely.

Another post is by Nogales, Pérez, and Monfort and deals with a Monte Carlo formula for conditional expectations that leaves me quite puzzled… More than excited about a potential application in ABC. Indeed, the idea is to replace the conditioning on the (hard) equality with a soft ball around the equality with a tolerance level ε. Since the authors do not seem particularly interested in ABC, I do no see the point in this technique. The first example they present does not account for the increasing computational effort as ε decreases. The second part of the paper deals with the best invariant estimator of the position parameter of a general half-normal distribution, using two conditional expectations in Proposition 2. I wonder how the method compares with bridge sampling and MCMC methods. As the paper seems to have gone beyond the Monte Carlo issue at this stage. And focus on this best invariant estimation…

Then Feroz, Hobson,  Cameron and Pettitt posted a work on importance nested sampling, on which I need to spend more time given the connection to our nested sampling paper with Nicolas. The paper builds on the Multinest software developped by Feroz, Hobson and co-authors. (Whose above picture is borrowed from.)

A side post on the moments of the partial non-central chi-square distribution function by Gill, Segura and Temme, since I always liked the specifics of this distribution… No comments so far!