## absurdly unbiased estimators

Posted in Books, Kids, Statistics with tags , , , , , , , on November 8, 2018 by xi'an

“…there are important classes of problems for which the mathematics forces the existence of such estimators.”

Recently I came through a short paper written by Erich Lehmann for The American Statistician, Estimation with Inadequate Information. He analyses the apparent absurdity of using unbiased estimators or even best unbiased estimators in settings like the Poisson P(λ) observation X producing the (unique) unbiased estimator of exp(-bλ) equal to $(1-b)^x$

which is indeed absurd when b>1. My first reaction to this example is that the question of what is “best” for a single observation is not very meaningful and that adding n independent Poisson observations replaces b with b/n, which gets eventually less than one. But Lehmann argues that the paradox stems from a case of missing information, as for instance in the Poisson example where the above quantity is the probability P(T=0) that T=0, when T=X+Y, Y being another unobserved Poisson with parameter (b-1)λ. In a lot of such cases, there is no unbiased estimator at all. When there is any, it must take values outside the (0,1) range, thanks to a lemma shown by Lehmann that the conditional expectation of this estimator given T is either zero or one.

I find the short paper quite interesting in exposing some reasons why the estimators cannot find enough information within the data (often a single point) to achieve an efficient estimation of the targeted function of the parameter, even though the setting may appear rather artificial.

## unbiased estimation of log-normalising constants

Posted in Statistics with tags , , , , , , , on October 16, 2018 by xi'an Maxime Rischard, Pierre Jacob, and Natesh Pillai [warning: both of whom are co-authors and friends of mine!] have just arXived a paper on the use of path sampling (a.k.a., thermodynamic integration) for log-constant unbiased approximation and the resulting consequences on Bayesian model comparison by X validation. If the goal is the estimation of the log of a ratio of two constants, creating an artificial path between the corresponding distributions and looking at the derivative at any point of this path of the log-density produces an unbiased estimator. Meaning that random sampling along the path, corrected by the distribution of the sampling still produces an unbiased estimator. From there the authors derive an unbiased estimator for any X validation objective function, CV(V,T)=-log p(V|T), taking m observations T in and leaving n-m observations T out… The marginal conditional log density in the criterion is indeed estimated by an unbiased path sampler, using a powered conditional likelihood. And unbiased MCMC schemes à la Jacob et al. for simulating unbiased MCMC realisations of the intermediary targets on the path. Tuning it towards an approximately constant cost for all powers.

So in all objectivity and fairness (!!!), I am quite excited by this new proposal within my favourite area! Or rather two areas since it brings together the estimation of constants and an alternative to Bayes factors for Bayesian testing. (Although the paper does not broach upon the calibration of the X validation values.)

## Bayesian synthetic likelihood

Posted in Statistics with tags , , , , , , , on December 13, 2017 by xi'an Leah Price, Chris Drovandi, Anthony Lee and David Nott published earlier this year a paper in JCGS on Bayesian synthetic likelihood, using Simon Wood’s synthetic likelihood as a substitute to the exact likelihood within a Bayesian approach. While not investigating the theoretical properties of this approximate approach, the paper compares it with ABC on some examples. In particular with respect to the number n of Monte Carlo replications used to approximate the mean and variance of the Gaussian synthetic likelihood.

Since this approach is most naturally associated with an MCMC implementation, it requires new simulations of the summary statistics at each iteration, without a clear possibility to involve parallel runs, in contrast to ABC. However in the final example of the paper, the authors reach values of n of several thousands, making use of multiple cores relevant, if requiring synchronicity and checks at every MCMC iteration.

The authors mention that “ABC can be viewed as a pseudo-marginal method”, but this has a limited appeal since the pseudo-marginal is a Monte Carlo substitute for the ABC target, not the original target. Similarly, there exists an unbiased estimator of the Gaussian density due to Ghurye and Olkin (1969) that allows to perceive the estimated synthetic likelihood version as a pseudo-marginal, once again wrt a target that differs from the original one. And the bias reappears under mis-specification, that is when the summary statistics are not normally distributed. It seems difficult to assess this normality or absence thereof in realistic situations.

“However, when the distribution of the summary statistic is highly irregular, the output of BSL cannot be trusted, while ABC represents a robust alternative in such cases.”

To make synthetic likelihood and ABC algorithms compatible, the authors chose a Normal kernel for ABC. Still, the equivalence is imperfect in that the covariance matrix need be chosen in the ABC case and is estimated in the synthetic one. I am also lost to the argument that the synthetic version is more efficient than ABC, in general (page 8). As for the examples, the first one uses a toy Poisson posterior with a single sufficient summary statistic, which is not very representative of complex situations where summary statistics are extremes or discrete. As acknowledged by the authors this is a case when the Normality assumption applies. For an integer support hidden process like the Ricker model, normality vanishes and the outcomes of ABC and synthetic likelihood differ, which makes it difficult to compare the inferential properties of both versions (rather than the acceptance rates), while using a 13-dimension statistic for estimating a 3-dimension parameter is not recommended for ABC, as discussed by Li and Fearnhead (2017). The same issue appears in the realistic cell motility example, with 145 summaries versus two parameters. (In the philogenies studied by DIYABC, the number of summary statistics is about the same but we now advocate a projection to the parameter dimension by the medium of random forests.)

Given the similarity between both approaches, I wonder at a confluence between them, where synthetic likelihood could maybe be used to devise PCA on the summary statistics and facilitate their projection on a space with much smaller dimensions. Or estimating the mean and variance functions in the synthetic likelihood towards producing directly simulations of the summary statistics.

## a closed-form but intractable birthday

Posted in Books, Kids, pictures, Statistics with tags , , , , , on November 7, 2016 by xi'an An interesting [at least for me!] question found on X Validated yesterday, namely how to simulate efficiently from the generalised birthday problem (or paradox) distribution, which provides the probability of finding exactly k different birthday dates as $\mathbb{P}(V = k) = \binom{n}{k}\displaystyle\sum_{i=0}^k (-1)^i \binom{k}{i} \left(\frac{k-i}{n}\right)^m$

where m is the number of individuals with a random birthday and n the number of days (e.g., n=365). The paradox with this closed-form formula (found by the inclusion-exclusion rule) is that it is too unstable to use per se. While it is always possible to run m draws from a uniform over {1,…,n} and count the number of different values, e.g.,

x=length(unique(sample(1:n,m,rep=TRUE)))


this takes much more time than using the exact distribution, if available:

sample(1:m,1e6,rep=TRUE,prob=eff[-1])


I played a little bit with the notion of using an unbiased estimator of the said probability, but the alternating series means that the unbiased estimator may end up being negative, which is an issue met in recent related papers like the famous Russian Roulette.

## importance sampling by kernel smoothing [experiment]

Posted in Books, R, Statistics with tags , , , , , , on October 13, 2016 by xi'an Following my earlier post on Delyon and Portier’s proposal to replacing the true importance distribution ƒ with a leave-one-out (!) kernel estimate in the importance sampling estimator, I ran a simple one-dimensional experiment to compare the performances of the traditional method with this alternative. The true distribution is a N(0,½) with an importance proposal a N(0,1) distribution, the target is the function h(x)=x⁶ [1-0.9 sin(3x)], n=2643 is the number of simulations, and the density is estimated via the call to the default density() R function. The first three boxes are for the regular importance sampler, and the kernel and the corrected kernel versions of Delyon and Portier, while the second set of three considers the self-normalised alternatives. In all kernel versions, the variability is indeed much lower than with importance sampling, but the bias is persistent, with no clear correction brought by the first order proposal in the paper, while those induce a significant increase in computing time:

> benchmark(
+ for (t in 1:100){
+   x=sort(rnorm(N));fx=dnorm(x)
+  imp1=dnorm(x,sd=.5)/fx})

replicas elapsed relative user.child sys.child
1        100     7.948    7.94       0.012
> benchmark(
+ for (t in 1:100){
+   x=sort(rnorm(N));hatf=density(x)
+   hatfx=approx(hatf$x,hatf$y, x)$y + imp2=dnorm(x,sd=.5)/hatfx}) replicas elapsed relative user.child sys.child 1 100 19.272 18.473 0.94 > benchmark( + for (t in 1:100){ + x=sort(rnorm(N));hatf=density(x) + hatfx=approx(hatf$x,hatf$y, x)$y
+   bw=hatf\$bw
+   for (i in 1:N) Kx[i]=1-sum((dnorm(x[i],
+     mean=x[-i],sd=bw)-hatfx[i])^2)/NmoNmt/hatfx[i]^2
+   imp3=dnorm(x,sd=.5)*Kx/hatfx})

replicas elapsed relative user.child sys.child
1        100     11378.38  7610.037  17.239


which follows from the O(n) cost in deriving the kernel estimate for all observations (and I did not even use the leave-one-out option…) The R computation of the variance is certainly not optimal, far from it, but those enormous values give an indication of the added cost of the step, which does not even seem productive in terms of variance reduction… [Warning: the comparison is only done over one model and one target integrand, thus does not pretend at generality!]

## importance sampling by kernel smoothing

Posted in Books, Statistics with tags , , , , , on September 27, 2016 by xi'an As noted in an earlier post, Bernard Delyon and François Portier have recently published a paper in Bernoulli about improving the speed of convergence of an importance sampling estimator of

∫ φ(x) dx

when replacing the true importance distribution ƒ with a leave-one-out (!) kernel estimate in the importance sampling estimator… They also consider a debiased version that converges even faster at the rate $n h_n^{d/2}$

where n is the sample size, h the bandwidth and d the dimension. There is however a caveat, namely a collection of restrictive assumptions on the components of this new estimator:

1. the integrand φ has a compact support, is bounded, and satisfies some Hölder-type regularity condition;
2. the importance distribution ƒ is upper and lower bounded, its r-th order derivatives are upper bounded;
3. the kernel K is order r, with exponential tails, and symmetric;
4. the leave-one-out correction for bias has a cost O(n²) compared with O(n) cost of the regular Monte-Carlo estimator;
5. the bandwidth h in the kernel estimator has a rate in n linked with the dimension d and the regularity indices of ƒ and φ

and this bandwidth needs to be evaluated as well. In the paper the authors rely on a control variate for which the integral is known, but which “looks like φ”, a strong requirement in appearance only since this new function is the convolution of φ with a kernel estimate of ƒ which expectation is the original importance estimate of the integral. This sounds convoluted but this is a generic control variate nonetheless! But this is also a costly step. Because of the kernel estimation aspect, the method deteriorates with the dimension of the variate x. However, since φ(x) is a real number, I wonder if running the non-parametric density estimate directly on the sample of φ(x)’s would lead to an improved estimator…

## MCqMC [#3]

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , on August 20, 2016 by xi'an On Thursday, Christoph Aistleiter [from TU Gräz] gave a plenary talk at MCqMC 2016 around Hermann Weyl’s 1916 paper, Über die Gleichverteilung von Zahlen mod. Eins, which demonstrates that the sequence a, 22a, 32a, … mod 1 is uniformly distributed on the unit interval when a is irrational. Obviously, the notion was not introduced for simulation purposes, but the construction applies in this setting! At least in a theoretical sense. Since for instance the result that the sequence (a,a²,a³,…) mod 1 being uniformly distributed for almost all a’s has not yet found one realisation a. But a nice hour of history of mathematics and number theory: it is not that common we hear the Riemann zeta function mentioned in a simulation conference!

The following session was a nightmare in that I wanted to attend all four at once! I eventually chose the transport session, in particular because Xiao-Li advertised it at the end of my talk. The connection is that his warp bridge sampling technique provides a folding map between modes of a target. Using a mixture representation of the target and folding all components to a single distribution. Interestingly, this transformation does not require a partition and preserves the normalising constants [which has a side appeal for bridge sampling of course]. In a problem with an unknown number of modes, the technique could be completed by [our] folding in order to bring the unobserved modes into the support of the folded target. Looking forward the incoming paper! The last talk of this session was by Matti Vihola, connecting multi-level Monte Carlo and unbiased estimation à la Rhee and Glynn, paper that I missed when it got arXived last December.

The last session of the day was about probabilistic numerics. I have already discussed extensively about this approach to numerical integration, to the point of being invited to the NIPS workshop as a skeptic! But this was an interesting session, both with introductory aspects and with new ones from my viewpoint, especially Chris Oates’ description of a PN method for handling both integrand and integrating measure as being uncertain. Another arXival that went under my decidedly deficient radar.