an elegant sampler

Posted in Books, Kids, R, University life with tags , , , , , , , on January 15, 2020 by xi'an

Following an X validated question on how to simulate a multinomial with fixed average, W. Huber produced a highly elegant and efficient resolution with the compact R code

tabulate(sample.int((k-1)*n, s-n) %% n + 1, n) + 1


where k is the number of classes, n the number of draws, and s equal to n times the fixed average. The R function sample.int is an alternative to sample that seems faster. Breaking the outcome of

sample.int((k-1)*n, s-n)


as nonzero positions in an n x (k-1) matrix and adding a adding a row of n 1’s leads to a simulation of integers between 1 and k by counting the 1’s in each of the n columns, which is the meaning of the above picture. Where the colour code is added after counting the number of 1’s. Since there are s 1’s in this matrix, the sum is automatically equal to s. Since the s-n positions are chosen uniformly over the n x (k-1) locations, the outcome is uniform. The rest of the R code is a brutally efficient way to translate the idea into a function. (By comparison, I brute-forced the question by suggesting a basic Metropolis algorithm.)

sampling-importance-resampling is not equivalent to exact sampling [triste SIR]

Posted in Books, Kids, Statistics, University life with tags , , , , , , on December 16, 2019 by xi'an

Following an X validated question on the topic, I reassessed a previous impression I had that sampling-importance-resampling (SIR) is equivalent to direct sampling for a given sample size. (As suggested in the above fit between a N(2,½) target and a N(0,1) proposal.)  Indeed, when one produces a sample

$x_1,\ldots,x_n \stackrel{\text{i.i.d.}}{\sim} g(x)$

and resamples with replacement from this sample using the importance weights

$f(x_1)g(x_1)^{-1},\ldots,f(x_n)g(x_n)^{-1}$

the resulting sample

$y_1,\ldots,y_n$

is neither “i.” nor “i.d.” since the resampling step involves a self-normalisation of the weights and hence a global bias in the evaluation of expectations. In particular, if the importance function g is a poor choice for the target f, meaning that the exploration of the whole support is imperfect, if possible (when both supports are equal), a given sample may well fail to reproduce the properties of an iid example ,as shown in the graph below where a Normal density is used for g while f is a Student t⁵ density:

label switching by optimal transport: Wasserstein to the rescue

Posted in Books, Statistics, Travel with tags , , , , , , , , , , , , , , on November 28, 2019 by xi'an

A new arXival by Pierre Monteiller et al. on resolving label switching by optimal transport. To appear in NeurIPS 2019, next month (where I will be, but extra muros, as I have not registered for the conference). Among other things, the paper was inspired from an answer of mine on X validated, presumably a première (and a dernière?!). Rather than picketing [in the likely unpleasant weather ]on the pavement outside the conference centre, here are my raw reactions to the proposal made in the paper. (Usual disclaimer: I was not involved in the review of this paper.)

“Previous methods such as the invariant losses of Celeux et al. (2000) and pivot alignments of Marin et al. (2005) do not identify modes in a principled manner.”

Unprincipled, me?! We did not aim at identifying all modes but only one of them, since the posterior distribution is invariant under reparameterisation. Without any bad feeling (!), I still maintain my position that using a permutation invariant loss function is a most principled and Bayesian approach towards a proper resolution of the issue. Even though figuring out the resulting Bayes estimate may prove tricky.

The paper thus adopts a different approach, towards giving a manageable meaning to the average of the mixture distributions over all permutations, not in a linear Euclidean sense but thanks to a Wasserstein barycentre. Which indeed allows for an averaged mixture density, although a point-by-point estimate that does not require switching to occur at all was already proposed in earlier papers of ours. Including the Bayesian Core. As shown above. What was first unclear to me is how necessary the Wasserstein formalism proves to be in this context. In fact, the major difference with the above picture is that the estimated barycentre is a mixture with the same number of components. Computing time? Bayesian estimate?

Green’s approach to the problem via a point process representation [briefly mentioned on page 6] of the mixture itself, as for instance presented in our mixture analysis handbook, should have been considered. As well as issues about Bayes factors examined in Gelman et al. (2003) and our more recent work with Kate Jeong Eun Lee. Where the practical impossibility of considering all possible permutations is processed by importance sampling.

An idle thought that came to me while reading this paper (in Seoul) was that a more challenging problem would be to face a model invariant under the action of a group with only a subset of known elements of that group. Or simply too many elements in the group. In which case averaging over the orbit would become an issue.

Posted in Books, Statistics with tags , , , on November 27, 2019 by xi'an

Following a question on X validated as to why the mean of the log of a uniform distribution is not log(0.5), I replied with the obvious link to Jensen’s inequality and the more general if equally obvious remark that expectation was rarely invariant under transforms and ended up with an high number of up-votes on that answer. Which bemuses me given the basic question and equally basic answer..!

conditioning on zero probability events

Posted in Books, Kids, pictures, Statistics, University life with tags , , , , on November 15, 2019 by xi'an

An interesting question on X validated as to how come a statistic T(X) can be sufficient when its support depends on the parameter θ behind the distribution of X. The reasoning there being that the distribution of X given T(X)=t does depend on θ since it is not defined for some values of θ … Which is not correct in that the conditional distribution of X depends on the realisation of T, meaning that if this realisation is impossible, then the conditional is arbitrary and of no relevance. Which also led me to tangentially notice and bemoan that most (Stack) exchanges on conditioning on zero probability events are pretty unsatisfactory in that they insist on interpreting P(X=x) [equal to zero] in a literal sense when it is merely a notation in the continuous case. And undefined when X has a discrete support. (Conditional probability is always a sore point for my students!)

Why do we draw parameters to draw from a marginal distribution that does not contain the parameters?

Posted in Statistics with tags , , , , , , , on November 3, 2019 by xi'an

A revealing question on X validated of a simulation concept students (and others) have trouble gripping with. Namely using auxiliary variates to simulate from a marginal distribution, since these auxiliary variables are later dismissed and hence appear to them (students) of no use at all. Even after being exposed to the accept-reject algorithm. Or to multiple importance sampling. In the sense that a realisation of a random variable can be associated with a whole series of densities in an importance weight, all of them being valid (but some more equal than others!).

What does the symbol for pi with a lower perpendicular mean?

Posted in Books, pictures, University life with tags , , , , on November 1, 2019 by xi'an