## The answer is e, what was the question?!

Posted in Books, R, Statistics with tags , , , , , on February 12, 2016 by xi'an

A rather exotic question on X validated: since π can be approximated by random sampling over a unit square, is there an equivalent for approximating e? This is an interesting question, as, indeed, why not focus on e rather than π after all?! But very quickly the very artificiality of the problem comes back to hit one in one’s face… With no restriction, it is straightforward to think of a Monte Carlo average that converges to e as the number of simulations grows to infinity. However, such methods like Poisson and normal simulations require some complex functions like sine, cosine, or exponential… But then someone came up with a connection to the great Russian probabilist Gnedenko, who gave as an exercise that the average number of uniforms one needs to add to exceed 1 is exactly e, because it writes as

$\sum_{n=0}^\infty\frac{1}{n!}=e$

(The result was later detailed in the American Statistician as an introductory simulation exercise akin to Buffon’s needle.) This is a brilliant solution as it does not involve anything but a standard uniform generator. I do not think it relates in any close way to the generation from a Poisson process with parameter λ=1 where the probability to exceed one in one step is e⁻¹, hence deriving  a Geometric variable from this process leads to an unbiased estimator of e as well. As an aside, W. Huber proposed the following elegantly concise line of R code to implement an approximation of e:

1/mean(n*diff(sort(runif(n+1))) > 1)

Hard to beat, isn’t it?! (Although it is more exactly a Monte Carlo approximation of

$\left(1-\frac{1}{n}\right)^n$

which adds a further level of approximation to the solution….)

## difference between Metropolis, Gibbs, importance, and rejection sampling

Posted in Books, Kids, Statistics with tags , , , , , on December 14, 2015 by xi'an

Last week, while I was preparing my talk for the NIPS workshop, I spotted this fairly generic question on X validated. And decided to procrastinate by answering through generic comments on the pros and cons of each method. This is a challenging if probably empty question as it lacks a measure of evaluation for those different approaches.  And this is another reason why I replied, in that it relates to my pondering the a-statistical nature of simulation-based approximation methods. Also called probabilistic numerics, not statistical numerics, eh! It is indeed close to impossible to compare such approaches and others on a general basis. For instance, the comparative analysis greatly differs when dealing with a once-in-a-lifetime problem and with an everyday issue, e.g. when building a package for a sufficiently standard model. In the former case, a quick-and-dirty off-the-shelf solution is recommended, while in the latter, designing an efficient and fine-tuned approach makes sense. (The pros and cons I discussed in my X validated answer thus do not apply in most settings!) If anything, using several approaches, whenever possible, is the best advice to give. If not on the targeted problem, at least on a toy or simulated version, to check for performances of those different tools. But this brings back the issue of cost and time… An endless garden of forking paths, one would say [in another setting].

## mixtures as exponential families

Posted in Kids, Statistics with tags , , , , , , , on December 8, 2015 by xi'an

Something I had not realised earlier and that came to me when answering a question on X validated about the scale parameter of a Gamma distribution. Following an earlier characterisation by Dennis Lindley, Ferguson has written a famous paper characterising location, scale and location-scale families within exponential families. For instance, a one-parameter location exponential family is necessarily the logarithm of a power of a Gamma distribution. What I found surprising is the equivalent for one-parameter scale exponential families: they are necessarily mixtures of positive and negative powers of Gamma distributions. This is surprising because a mixture does not seem to fit within the exponential family representation… I first thought Ferguson was using a different type of mixtures. Or of exponential family. But, after checking the details, it appears that the mixture involves a component on ℜ⁺ and another component on ℜ⁻ with a potential third component as a Dirac mass at zero. Hence, it only nominally writes as a mixture and does not offer the same challenges as a regular mixture. Not label switching. No latent variable. Having mutually exclusive supports solves all those problems and even allows for an indicator function to permeate through the exponential function… (Recall that the special mixture processed in Rubio and Steel also enjoys this feature.)

## Sunday morning puzzle

Posted in Books, Kids, R with tags , , , on November 22, 2015 by xi'an

A question from X validated that took me quite a while to fathom and then the solution suddenly became quite obvious:

If a sample taken from an arbitrary distribution on {0,1}⁶ is censored from its (0,0,0,0,0,0) elements, and if the marginal probabilities are know for all six components of the random vector, what is an estimate of the proportion of (missing) (0,0,0,0,0,0) elements?

Since the censoring modifies all probabilities by the same renormalisation, i.e. divides them by the probability to be different from (0,0,0,0,0,0), ρ, this probability can be estimated by looking at the marginal probabilities to be equal to 1, which equal the original and known marginal probabilities divided by ρ. Here is a short R code illustrating the approach that I wrote in the taxi home yesterday night:

#generate vectors
N=1e5
zprobs=c(.1,.9) #iid example
smpl=matrix(sample(0:1,6*N,rep=TRUE,prob=zprobs),ncol=6)
pty=apply(smpl,1,sum)
smpl=smpl[pty>0,]
ps=apply(smpl,2,mean)
cor=mean(ps/rep(zprobs[2],6))
#estimated original size
length(smpl[,1])*cor


A broader question is how many values (and which values) of the sample can be removed before this recovery gets impossible (with the same amount of information).

## data augmentation with divergence

Posted in Books, Kids, Statistics, University life with tags , , , , , on November 18, 2015 by xi'an

Another (!) Cross Validated question that shed some light on the difficulties of explaining the convergence of MCMC algorithms. Or in understanding conditioning and hierarchical models. The author wanted to know why a data augmentation of his did not converge: In a simplified setting, given an observation y that he wrote as y=h(x,θ), he had built a Gibbs sampler by reconstructing x=g(y,θ) and simulating θ given x: at each iteration t,

1. compute xt=g(y,θt-1)
2. simulate θt~π(θ|xt)

and he attributed the lack of convergence to a possible difficulty with the Jacobian. My own interpretation of the issue was rather that condition on the unobserved x was not the same as conditioning on the observed y and hence that y was missing from step 2. And that the simulation of x is useless. Unless one uses it in an augmented scheme à la Xiao-Li… Nonetheless, I like the problem, if only because my very first reaction was to draw a hierarchical dependence graph and to conclude this should be correct, before checking on a toy example that it was not!

## rediscovering the harmonic mean estimator

Posted in Kids, Statistics, University life with tags , , , , , , , on November 10, 2015 by xi'an

When looking at unanswered questions on X validated, I came across a question where the author wanted to approximate a normalising constant

$N=\int g(x)\,\text{d}x\,,$

while simulating from the associated density, g. While seemingly unaware of the (huge) literature in the area, he re-derived [a version of] the harmonic mean estimate by considering the [inverted importance sampling] identity

$\int_\mathcal{X} \dfrac{\alpha(x)}{g(x)}p(x) \,\text{d}x=\int_\mathcal{X} \dfrac{\alpha(x)}{N} \,\text{d}x=\dfrac{1}{N}$

when α is a probability density and by using for α the uniform over the whole range of the simulations from g. This choice of α obviously leads to an estimator with infinite variance when the support of g is unbounded, but the idea can be easily salvaged by using instead another uniform distribution, for instance on an highest density region, as we studied in our papers with Darren Wraith and Jean-Michel Marin. (Unfortunately, the originator of the question does not seem any longer interested in the problem.)

## Gauss to Laplace transmutation interpreted

Posted in Books, Kids, Statistics, University life with tags , , , , , , on November 9, 2015 by xi'an

Following my earlier post [induced by browsing X validated], on the strange property that the product of a Normal variate by an Exponential variate is a Laplace variate, I got contacted by Peng Ding from UC Berkeley, who showed me how to derive the result by a mere algebraic transform, related with the decomposition

(X+Y)(X-Y)=X²-Y² ~ 2XY

when X,Y are iid Normal N(0,1). Peng Ding and Joseph Blitzstein have now arXived a note detailing this derivation, along with another derivation using the moment generating function. As a coincidence, I also came across another interesting representation on X validated, namely that, when X and Y are Normal N(0,1) variates with correlation ρ,

XY ~ R(cos(πU)+ρ)

with R Exponential and U Uniform (0,1). As shown by the OP of that question, it is a direct consequence of the decomposition of (X+Y)(X-Y) and of the polar or Box-Muller representation. This does not lead to a standard distribution of course, but remains a nice representation of the product of two Normals.