## Sunday morning puzzle

Posted in Books, Kids, R with tags , , , on November 22, 2015 by xi'an

A question from X validated that took me quite a while to fathom and then the solution suddenly became quite obvious:

If a sample taken from an arbitrary distribution on {0,1}⁶ is censored from its (0,0,0,0,0,0) elements, and if the marginal probabilities are know for all six components of the random vector, what is an estimate of the proportion of (missing) (0,0,0,0,0,0) elements?

Since the censoring modifies all probabilities by the same renormalisation, i.e. divides them by the probability to be different from (0,0,0,0,0,0), ρ, this probability can be estimated by looking at the marginal probabilities to be equal to 1, which equal the original and known marginal probabilities divided by ρ. Here is a short R code illustrating the approach that I wrote in the taxi home yesterday night:

#generate vectors
N=1e5
zprobs=c(.1,.9) #iid example
smpl=matrix(sample(0:1,6*N,rep=TRUE,prob=zprobs),ncol=6)
pty=apply(smpl,1,sum)
smpl=smpl[pty>0,]
ps=apply(smpl,2,mean)
cor=mean(ps/rep(zprobs[2],6))
#estimated original size
length(smpl[,1])*cor


A broader question is how many values (and which values) of the sample can be removed before this recovery gets impossible (with the same amount of information).

## data augmentation with divergence

Posted in Books, Kids, Statistics, University life with tags , , , , , on November 18, 2015 by xi'an

Another (!) Cross Validated question that shed some light on the difficulties of explaining the convergence of MCMC algorithms. Or in understanding conditioning and hierarchical models. The author wanted to know why a data augmentation of his did not converge: In a simplified setting, given an observation y that he wrote as y=h(x,θ), he had built a Gibbs sampler by reconstructing x=g(y,θ) and simulating θ given x: at each iteration t,

1. compute xt=g(y,θt-1)
2. simulate θt~π(θ|xt)

and he attributed the lack of convergence to a possible difficulty with the Jacobian. My own interpretation of the issue was rather that condition on the unobserved x was not the same as conditioning on the observed y and hence that y was missing from step 2. And that the simulation of x is useless. Unless one uses it in an augmented scheme à la Xiao-Li… Nonetheless, I like the problem, if only because my very first reaction was to draw a hierarchical dependence graph and to conclude this should be correct, before checking on a toy example that it was not!

## rediscovering the harmonic mean estimator

Posted in Kids, Statistics, University life with tags , , , , , , , on November 10, 2015 by xi'an

When looking at unanswered questions on X validated, I came across a question where the author wanted to approximate a normalising constant

$N=\int g(x)\,\text{d}x\,,$

while simulating from the associated density, g. While seemingly unaware of the (huge) literature in the area, he re-derived [a version of] the harmonic mean estimate by considering the [inverted importance sampling] identity

$\int_\mathcal{X} \dfrac{\alpha(x)}{g(x)}p(x) \,\text{d}x=\int_\mathcal{X} \dfrac{\alpha(x)}{N} \,\text{d}x=\dfrac{1}{N}$

when α is a probability density and by using for α the uniform over the whole range of the simulations from g. This choice of α obviously leads to an estimator with infinite variance when the support of g is unbounded, but the idea can be easily salvaged by using instead another uniform distribution, for instance on an highest density region, as we studied in our papers with Darren Wraith and Jean-Michel Marin. (Unfortunately, the originator of the question does not seem any longer interested in the problem.)

## Gauss to Laplace transmutation interpreted

Posted in Books, Kids, Statistics, University life with tags , , , , , , on November 9, 2015 by xi'an

Following my earlier post [induced by browsing X validated], on the strange property that the product of a Normal variate by an Exponential variate is a Laplace variate, I got contacted by Peng Ding from UC Berkeley, who showed me how to derive the result by a mere algebraic transform, related with the decomposition

(X+Y)(X-Y)=X²-Y² ~ 2XY

when X,Y are iid Normal N(0,1). Peng Ding and Joseph Blitzstein have now arXived a note detailing this derivation, along with another derivation using the moment generating function. As a coincidence, I also came across another interesting representation on X validated, namely that, when X and Y are Normal N(0,1) variates with correlation ρ,

XY ~ R(cos(πU)+ρ)

with R Exponential and U Uniform (0,1). As shown by the OP of that question, it is a direct consequence of the decomposition of (X+Y)(X-Y) and of the polar or Box-Muller representation. This does not lead to a standard distribution of course, but remains a nice representation of the product of two Normals.

## miXed distributions

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , on November 3, 2015 by xi'an

A couple of questions on X validated showed the difficulty students have with mixed measures and their density. Actually, my students always react with incredulity to the likelihood of a censored normal sample or to the derivation of a Bayes factor associated with the null (and atomic) hypothesis μ=0…

I attribute this difficulty to a poor understanding of the notion of density and hence to a deficiency in the training in measure theory, since the density f of the distribution F is always relative to a reference measure dμ, i.e.

f(x) = dF/dμ(x)

(Hence Lebesgue’s moustache on the attached poster!) To handle atoms in the distribution requires introducing a dominating measure dμ with atomic components, i.e., usually a sum of the Lebesgue measure and of the counting measure on the appropriate set. Which is not so absolutely obvious: while the first question had {0,1} as atoms, the second question introduced atoms on {-θ,θ}and required a change of variable to consider a counting measure on {-1,1}. I found this second question actually of genuine interest and a great toy example for class and exams.

## Think Bayes: Bayesian Statistics Made Simple

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , , , on October 27, 2015 by xi'an

By some piece of luck, I came upon the book Think Bayes: Bayesian Statistics Made Simple, written by Allen B. Downey and published by Green Tea Press [which I could relate to No Starch Press, focussing on coffee!, which published Statistics Done Wrong that I reviewed a while ago] which usually publishes programming books with fun covers. The book is available on-line for free in pdf and html formats, and I went through it during a particularly exciting administrative meeting…

“Most books on Bayesian statistics use mathematical notation and present ideas in terms of mathematical concepts like calculus. This book uses Python code instead of math, and discrete approximations instead of continuous mathematics. As a result, what would be an integral in a math book becomes a summation, and most operations on probability distributions are simple loops.”

The book is most appropriately published in this collection as most of it concentrates on Python programming, with hardly any maths formula. In some sense similar to Jim Albert’s R book. Obviously, coming from maths, and having never programmed in Python, I find the approach puzzling, But just as obviously, I am aware—both from the comments on my books and from my experience on X validated—that a large group (majority?) of newcomers to the Bayesian realm find the mathematical approach to the topic a major hindrance. Hence I am quite open to this editorial choice as it is bound to include more people to think Bayes, or to think they can think Bayes.

“…in fewer than 200 pages we have made it from the basics of probability to the research frontier. I’m very happy about that.”

The choice made of operating almost exclusively through motivating examples is rather traditional in US textbooks. See e.g. Albert’s book. While it goes against my French inclination to start from theory and concepts and end up with illustrations, I can see how it operates in a programming book. But as always I fear it makes generalisations uncertain and understanding more shaky… The examples are per force simple and far from realistic statistics issues. Hence illustrates more the use of Bayesian thinking for decision making than for data analysis. To wit, those examples are about the Monty Hall problem and other TV games, some urn, dice, and coin models, blood testing, sport predictions, subway waiting times, height variability between men and women, SAT scores, cancer causality, a Geiger counter hierarchical model inspired by Jaynes, …, the exception being the final Belly Button Biodiversity dataset in the final chapter, dealing with the (exciting) unseen species problem in an equally exciting way. This may explain why the book does not cover MCMC algorithms. And why ABC is covered through a rather artificial normal example. Which also hides some of the maths computations under the carpet.

“The underlying idea of ABC is that two datasets are alike if they yield the same summary statistics. But in some cases, like the example in this chapter, it is not obvious which summary statistics to choose.¨

In conclusion, this is a very original introduction to Bayesian analysis, which I welcome for the reasons above. Of course, it is only an introduction, which should be followed by a deeper entry into the topic, and with [more] maths. In order to handle more realistic models and datasets.

## Gauss to Laplace transmutation!

Posted in Kids, Statistics, University life, Books with tags , , , , on October 14, 2015 by xi'an

When browsing X validated the other day [translate by procrastinating!], I came upon the strange property that the marginal distribution of a zero mean normal variate with exponential variance is a Laplace distribution. I first thought there was a mistake since we usually take an inverse Gamma on the variance parameter, not a Gamma. But then the marginal is a t distribution. The result is curious and can be expressed in a variety of ways:

– the product of a χ21 and of a χ2 is a χ22;
– the determinant of a 2×2 normal matrix is a Laplace variate;
– a difference of exponentials is Laplace…

The OP was asking for a direct proof of the result and I eventually sorted it out by a series of changes of variables, although there exists a much more elegant and general proof by Mike West, then at the University of Warwick, based on characteristic functions (or Fourier transforms). It reminded me that continuous, unimodal [at zero] and symmetric densities were necessary scale mixtures [a wee misnomer] of Gaussians. Mike proves in this paper that exponential power densities [including both the Normal and the Laplace cases] correspond to the variances having an inverse positive stable distribution with half the power. And this is a straightforward consequence of the exponential power density being proportional to the Fourier transform of a stable distribution and of a Fubini inversion. (Incidentally, the processing times of Biometrika were not that impressive at the time, with a 2-page paper submitted in Dec. 1984 published in Sept. 1987!)

This is a very nice and general derivation, but I still miss the intuition as to why it happens that way. But then, I know nothing, and even less about products of random variates!