## importance sampling by kernel smoothing [experiment]

Posted in Books, R, Statistics with tags , , , , , , on October 13, 2016 by xi'an

Following my earlier post on Delyon and Portier’s proposal to replacing the true importance distribution ƒ with a leave-one-out (!) kernel estimate in the importance sampling estimator, I ran a simple one-dimensional experiment to compare the performances of the traditional method with this alternative. The true distribution is a N(0,½) with an importance proposal a N(0,1) distribution, the target is the function h(x)=x⁶ [1-0.9 sin(3x)], n=2643 is the number of simulations, and the density is estimated via the call to the default density() R function. The first three boxes are for the regular importance sampler, and the kernel and the corrected kernel versions of Delyon and Portier, while the second set of three considers the self-normalised alternatives. In all kernel versions, the variability is indeed much lower than with importance sampling, but the bias is persistent, with no clear correction brought by the first order proposal in the paper, while those induce a significant increase in computing time:

> benchmark(
+ for (t in 1:100){
+   x=sort(rnorm(N));fx=dnorm(x)
+  imp1=dnorm(x,sd=.5)/fx})

replicas elapsed relative user.child sys.child
1        100     7.948    7.94       0.012
> benchmark(
+ for (t in 1:100){
+   x=sort(rnorm(N));hatf=density(x)
+   hatfx=approx(hatf$x,hatf$y, x)$y + imp2=dnorm(x,sd=.5)/hatfx}) replicas elapsed relative user.child sys.child 1 100 19.272 18.473 0.94 > benchmark( + for (t in 1:100){ + x=sort(rnorm(N));hatf=density(x) + hatfx=approx(hatf$x,hatf$y, x)$y
+   bw=hatf\$bw
+   for (i in 1:N) Kx[i]=1-sum((dnorm(x[i],
+     mean=x[-i],sd=bw)-hatfx[i])^2)/NmoNmt/hatfx[i]^2
+   imp3=dnorm(x,sd=.5)*Kx/hatfx})

replicas elapsed relative user.child sys.child
1        100     11378.38  7610.037  17.239


which follows from the O(n) cost in deriving the kernel estimate for all observations (and I did not even use the leave-one-out option…) The R computation of the variance is certainly not optimal, far from it, but those enormous values give an indication of the added cost of the step, which does not even seem productive in terms of variance reduction… [Warning: the comparison is only done over one model and one target integrand, thus does not pretend at generality!]

## importance sampling by kernel smoothing

Posted in Books, Statistics with tags , , , , , on September 27, 2016 by xi'an

As noted in an earlier post, Bernard Delyon and François Portier have recently published a paper in Bernoulli about improving the speed of convergence of an importance sampling estimator of

∫ φ(x) dx

when replacing the true importance distribution ƒ with a leave-one-out (!) kernel estimate in the importance sampling estimator… They also consider a debiased version that converges even faster at the rate

$n h_n^{d/2}$

where n is the sample size, h the bandwidth and d the dimension. There is however a caveat, namely a collection of restrictive assumptions on the components of this new estimator:

1. the integrand φ has a compact support, is bounded, and satisfies some Hölder-type regularity condition;
2. the importance distribution ƒ is upper and lower bounded, its r-th order derivatives are upper bounded;
3. the kernel K is order r, with exponential tails, and symmetric;
4. the leave-one-out correction for bias has a cost O(n²) compared with O(n) cost of the regular Monte-Carlo estimator;
5. the bandwidth h in the kernel estimator has a rate in n linked with the dimension d and the regularity indices of ƒ and φ

and this bandwidth needs to be evaluated as well. In the paper the authors rely on a control variate for which the integral is known, but which “looks like φ”, a strong requirement in appearance only since this new function is the convolution of φ with a kernel estimate of ƒ which expectation is the original importance estimate of the integral. This sounds convoluted but this is a generic control variate nonetheless! But this is also a costly step. Because of the kernel estimation aspect, the method deteriorates with the dimension of the variate x. However, since φ(x) is a real number, I wonder if running the non-parametric density estimate directly on the sample of φ(x)’s would lead to an improved estimator…

## exam question

Posted in Kids, Statistics, University life with tags , , , , , , , , , on June 24, 2016 by xi'an

A question for my third year statistics exam that I borrowed from Cross Validated: no student even attempted to solve this question…!

And another one borrowed from the highly popular post on the random variable [almost] always smaller than its mean!

## posterior predictive checks for admixture models

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , on July 8, 2014 by xi'an

In a posting coincidence, just a few days after we arXived our paper on ABC model choice with random forests, where we use posterior predictive errors for assessing the variability of the random forest procedure, David Mimno, David Blei, and Barbara Engelhardt arXived a paper on posterior predictive checks to quantify lack-of-fit in admixture models of latent population structure, which deals with similar data and models, while also using the posterior predictive as a central tool. (Marginalia: the paper is a wee bit difficult to read [esp. with France-Germany playing in the airport bar!] as the modelling is only clearly described at the very end. I suspect this arXived version was put together out of a submission to a journal like Nature or PNAS, with mentions of a Methods section that does not appear here and of Supplementary Material that turned into subsections of the Discussion section.)

The dataset are genomic datasets made of SNPs (single nucleotide polymorphisms). For instance, the first (HapMap) dataset corresponds to 1,043 individuals and 468,167 SNPs. The model is simpler than Kingman’s coalescent, hence its likelihood does not require ABC steps to run inference. The admixture model in the paper is essentially a mixture model over ancestry indices with individual dependent weights with Bernoulli observations, hence resulting into a completed likelihood of the form

$\prod_{i=1}^n\prod_{\ell=1}^L\prod_j \phi_{\ell,z_{i,\ell,j}}^{x_{i,\ell,j}}(1-\phi_{\ell,z_{i,\ell,j}})^{1-x_{i,\ell,j}}\theta_{i,z_{i,\ell,j}}$

(which looks more formidable than it truly is!). Regular Bayesian inference is thus possible in this setting, implementing e.g. Gibbs sampling. The authors chose instead to rely on EM and thus derived the maximum likelihood estimators of the (many) parameters of the admixture. And of the latent variables z. Their posterior predictive check is based on the simulation of pseudo-observations (as in ABC!) from the above likelihood, with parameters and latent variables replaced with their EM estimates (unlike ABC). There is obviously some computational reason in doing this instead of simulating from the posterior, albeit implicit in the paper. I am however slightly puzzled by the conditioning on the latent variable estimate , as its simulation is straightforward and as a latent variable is more a missing observation than a parameter. Given those 30 to 100 replications of the data, an empirical distribution of a discrepancy function is used to assess whether or not the equivalent discrepancy for the observation is an outlier. If so, the model is not appropriate for the data. (Interestingly, the discrepancy is measured via the Bayes factor of z-scores.)

The connection with our own work is that the construction of discrepancy measures proposed in this paper could be added to our already large collection of summary statistics to check to potential impact in model comparison, i.e. for a significant contribution to the random forest nodes.  Conversely, the most significant summary statistics could then be tested as discrepancy measures. Or, more in tune with our Series B paper on the proper selection of summary variables, the distribution of those discrepancy measures could be compared across potential models. Assuming this does not take too much computing power…

## interacting particle systems as… facebook

Posted in Books, Statistics, University life with tags , , , , , on October 8, 2013 by xi'an

Among the many interesting arXived papers this Friday, I first read David Aldous’ “Interacting particle systems as stochastic social dynamics“. Being unfamiliar with those systems (despite having experts in offices down the hall in Paris-Dauphine!), I read this typology of potential models (published in Bernoulli) with a keen interest! The paper stemmed from a short course given in 2012 in Warwick and Cornell. I think the links exhibited there with (social) networks should be relevant for statisticians working on networks (!) and dynamic graphical models. Statistics is not mentioned in the paper, except for the (misleading) connection with statistical physics, but there is obviously a huge potential for statistical inference, from parameter estimation to model comparison. (As pointed out by David Aldous, there is usually “no data or evidence linking the model to the asserted real-world phenomena”.) The paper then introduces some basic models like the token, the pandemic and the averaging process, plus the voter model that relates to Kingman’s coalescent. A very nice read opening new vistas for sure (and a source of projects for graduate students most certainly!)