Archive for non-parametric kernel estimation

hitting a wall

Posted in Books, Kids, R, Statistics, University life with tags , , , , , on July 5, 2018 by xi'an

Once in a while, or a wee bit more frequently (!), it proves impossible to communicate with a contributor of a question on X validated. A recent instance was about simulating from a multivariate kernel density estimate where the kernel terms at x¹,x²,… are Gaussian kernels applied to the inverses of the norms |x-x¹|, |x-x²|,… rather than to the norms as in the usual formulation. The reason for using this type of kernel is unclear, as it certainly does not converge to an estimate of the density of the sample x¹,x²,…  as the sample size grows, since it excludes a neighbourhood of each point in the sample. Since the kernel term tends to a non-zero constant at infinity, the support of the density estimate is restricted to the hypercube [0,1]x…x[0,1], again with unclear motivations. No mention being made of the bandwidth adopted for this kernel. If one takes this exotic density as a given, the question is rather straightforward as the support is compact, the density bounded and a vanilla accept-reject can be implemented. As illustrated by the massive number of comments on that entry, it did not work as the contributor adopted a fairly bellicose attitude about suggestions from moderators on that site and could not see the point in our requests for clarification, despite plotting a version of the kernel that had its maximum [and not its minimum] at x¹… After a few attempts, including writing a complete answer, from which the above graph is taken (based on an initial understanding of the support being for (x-x¹), …), I gave up and deleted all my entries.On that question.

ABC²DE

Posted in Books, Statistics with tags , , , , , , , , , , , , , on June 25, 2018 by xi'an

A recent arXival on a new version of ABC based on kernel estimators (but one could argue that all ABC versions are based on kernel estimators, one way or another.) In this ABC-CDE version, Izbicki,  Lee and Pospisilz [from CMU, hence the picture!] argue that past attempts failed to exploit the full advantages of kernel methods, including the 2016 ABCDE method (from Edinburgh) briefly covered on this blog. (As an aside, CDE stands for conditional density estimation.) They also criticise these attempts at selecting summary statistics and hence failing in sufficiency, which seems a non-issue to me, as already discussed numerous times on the ‘Og. One point of particular interest in the long list of drawbacks found in the paper is the inability to compare several estimates of the posterior density, since this is not directly ingrained in the Bayesian construct. Unless one moves to higher ground by calling for Bayesian non-parametrics within the ABC algorithm, a perspective which I am not aware has been pursued so far…

The selling points of ABC-CDE are that (a) the true focus is on estimating a conditional density at the observable x⁰ rather than everywhere. Hence, rejecting simulations from the reference table if the pseudo-observations are too far from x⁰ (which implies using a relevant distance and/or choosing adequate summary statistics). And then creating a conditional density estimator from this subsample (which makes me wonder at a double use of the data).

The specific density estimation approach adopted for this is called FlexCode and relates to an earlier if recent paper from Izbicki and Lee I did not read. As in many other density estimation approaches, they use an orthonormal basis (including wavelets) in low dimension to estimate the marginal of the posterior for one or a few components of the parameter θ. And noticing that the posterior marginal is a weighted average of the terms in the basis, where the weights are the posterior expectations of the functions themselves. All fine! The next step is to compare [posterior] estimators through an integrated squared error loss that does not integrate the prior or posterior and does not tell much about the quality of the approximation for Bayesian inference in my opinion. It is furthermore approximated by  a doubly integrated [over parameter and pseudo-observation] squared error loss, using the ABC(ε) sample from the prior predictive. And the approximation error only depends on the regularity of the error, that is the difference between posterior and approximated posterior. Which strikes me as odd, since the Monte Carlo error should take over but does not appear at all. I am thus unclear as to whether or not the convergence results are that relevant. (A difficulty with this paper is the strong dependence on the earlier one as it keeps referencing one version or another of FlexCode. Without reading the original one, I spotted a mention made of the use of random forests for selecting summary statistics of interest, without detailing the difference with our own ABC random forest papers (for both model selection and estimation). For instance, the remark that “nuisance statistics do not affect the performance of FlexCode-RF much” reproduces what we observed with ABC-RF.

The long experiment section always relates to the most standard rejection ABC algorithm, without accounting for the many alternatives produced in the literature (like Li and Fearnhead, 2018. that uses Beaumont et al’s 2002 scheme, along with importance sampling improvements, or ours). In the case of real cosmological data, used twice, I am uncertain of the comparison as I presume the truth is unknown. Furthermore, from having worked on similar data a dozen years ago, it is unclear why ABC is necessary in such context (although I remember us running a test about ABC in the Paris astrophysics institute once).

postprocessing for ABC

Posted in Books, Statistics with tags , , , , on June 1, 2017 by xi'an

Two weeks ago, G.S. Rodrigues, Dennis Prangle and Scott Sisson have recently arXived a paper on recalibrating ABC output to make it correctly calibrated (in the frequentist sense). As in earlier papers, it takes advantage of the fact that the tail posterior probability should be uniformly distributed at the true value of the [simulated] parameter behind the [simulated] data. And as in Prangle et al. (2014), relies on a copula representation. The main notion is that marginals posteriors can be reasonably approximated by non-parametric kernel estimators, which means that an F⁰oF⁻¹ transform can be applied to an ABC reference table in a fully non-parametric extension of Beaumont et al.  (2002). Besides the issue that F is an approximation, I wonder about the computing cost of this approach, given that computing the post-processing transforms comes at a cost of O(pT²) when p is the dimension of the parameter and T the size of the ABC learning set… One question that came to me while discussing the paper with Jean-Michel Marin is why one would use F⁻¹(θ¹|s) instead of directly a uniform U(0,1) since in theory this should be a uniform U(0,1).

importance sampling by kernel smoothing [experiment]

Posted in Books, R, Statistics with tags , , , , , , on October 13, 2016 by xi'an

Following my earlier post on Delyon and Portier’s proposal to replacing the true importance distribution ƒ with a leave-one-out (!) kernel estimate in the importance sampling estimator, I ran a simple one-dimensional experiment to compare the performances of the traditional method with this alternative. The true distribution is a N(0,½) with an importance proposal a N(0,1) distribution, the target is the function h(x)=x⁶ [1-0.9 sin(3x)], n=2643 is the number of simulations, and the density is estimated via the call to the default density() R function. The first three boxes are for the regular importance sampler, and the kernel and the corrected kernel versions of Delyon and Portier, while the second set of three considers the self-normalised alternatives. In all kernel versions, the variability is indeed much lower than with importance sampling, but the bias is persistent, with no clear correction brought by the first order proposal in the paper, while those induce a significant increase in computing time:

> benchmark(
+ for (t in 1:100){
+   x=sort(rnorm(N));fx=dnorm(x)
+  imp1=dnorm(x,sd=.5)/fx})

replicas elapsed relative user.child sys.child
1        100     7.948    7.94       0.012
> benchmark(
+ for (t in 1:100){
+   x=sort(rnorm(N));hatf=density(x)
+   hatfx=approx(hatf$x,hatf$y, x)$y
+   imp2=dnorm(x,sd=.5)/hatfx})
    
replicas elapsed relative user.child sys.child
1        100      19.272  18.473     0.94

> benchmark(
+ for (t in 1:100){
+   x=sort(rnorm(N));hatf=density(x)
+   hatfx=approx(hatf$x,hatf$y, x)$y
+   bw=hatf$bw
+   for (i in 1:N) Kx[i]=1-sum((dnorm(x[i],
+     mean=x[-i],sd=bw)-hatfx[i])^2)/NmoNmt/hatfx[i]^2
+   imp3=dnorm(x,sd=.5)*Kx/hatfx})

replicas elapsed relative user.child sys.child
1        100     11378.38  7610.037  17.239

which follows from the O(n) cost in deriving the kernel estimate for all observations (and I did not even use the leave-one-out option…) The R computation of the variance is certainly not optimal, far from it, but those enormous values give an indication of the added cost of the step, which does not even seem productive in terms of variance reduction… [Warning: the comparison is only done over one model and one target integrand, thus does not pretend at generality!]

importance sampling by kernel smoothing

Posted in Books, Statistics with tags , , , , , on September 27, 2016 by xi'an

As noted in an earlier post, Bernard Delyon and François Portier have recently published a paper in Bernoulli about improving the speed of convergence of an importance sampling estimator of

∫ φ(x) dx

when replacing the true importance distribution ƒ with a leave-one-out (!) kernel estimate in the importance sampling estimator… They also consider a debiased version that converges even faster at the rate

n h_n^{d/2}

where n is the sample size, h the bandwidth and d the dimension. There is however a caveat, namely a collection of restrictive assumptions on the components of this new estimator:

  1. the integrand φ has a compact support, is bounded, and satisfies some Hölder-type regularity condition;
  2. the importance distribution ƒ is upper and lower bounded, its r-th order derivatives are upper bounded;
  3. the kernel K is order r, with exponential tails, and symmetric;
  4. the leave-one-out correction for bias has a cost O(n²) compared with O(n) cost of the regular Monte-Carlo estimator;
  5. the bandwidth h in the kernel estimator has a rate in n linked with the dimension d and the regularity indices of ƒ and φ

and this bandwidth needs to be evaluated as well. In the paper the authors rely on a control variate for which the integral is known, but which “looks like φ”, a strong requirement in appearance only since this new function is the convolution of φ with a kernel estimate of ƒ which expectation is the original importance estimate of the integral. This sounds convoluted but this is a generic control variate nonetheless! But this is also a costly step. Because of the kernel estimation aspect, the method deteriorates with the dimension of the variate x. However, since φ(x) is a real number, I wonder if running the non-parametric density estimate directly on the sample of φ(x)’s would lead to an improved estimator…