a problem that did not need ABC in the end

Posted in Books, pictures, Statistics, Travel with tags , , , , , , , , , , , , on August 8, 2019 by xi'an

While in Denver, at JSM, I came across [across validated!] this primarily challenging problem of finding the posterior of the 10³ long probability vector of a Multinomial M(10⁶,p) when only observing the range of a realisation of M(10⁶,p). This sounded challenging because the distribution of the pair (min,max) is not available in closed form. (Although this allowed me to find a paper on the topic by the late Shanti Gupta, who was chair at Purdue University when I visited 32 years ago…) This seemed to call for ABC (especially since I was about to give an introductory lecture on the topic!, law of the hammer…), but the simulation of datasets compatible with the extreme values of both minimum and maximum, m=80 and M=12000, proved difficult when using a uniform Dirichlet prior on the probability vector, since these extremes called for both small and large values of the probabilities. However, I later realised that the problem could be brought down to a Multinomial with only three categories and the observation (m,M,n-m-M), leading to an obvious Dirichlet posterior and a predictive for the remaining 10³-2 realisations.

alternatives to EM

Posted in Books, Statistics with tags , , , , , , , on January 30, 2019 by xi'an

In an arXived preprint submitted to Computational Statistics & Data Analysis, Chan, Han, and Lim study alternatives to EM for latent class models. That is, mixtures of products of Multinomials. (First occurrence of an indicator function being called the “Iverson bracket function”!) The introduction is fairly extensive given this most studied model. The criticisms of EM laid by the authors are that (a) it does not produce an evaluation of the estimation error, which does not sound correct; (b) the convergence is slow, which is also rather misleading as my [low dimensional] experience with mixtures is that it gets very quickly and apparently linearly  to the vicinity of one of the modes. The argument in favour of alternative non-linear optimisation approaches is that they can achieve quadratic convergence. One solution is a projected Quasi-Newton method, based on a quadratic approximation to the target. With some additional intricacies that make the claim of being “way easier than EM algorithm” somewhat specious. The second approach proposed in the paper is sequential quadratic programming, which incorporates the Lagrange multiplier in the target. While the different simulations in the paper show that EM may indeed call for a much larger number of iterations, the obtained likelihoods all are comparable.

multinomial resampling by Metropolis

Posted in Books, Statistics with tags , , , , , on December 28, 2017 by xi'an

A few years ago Lawrence Murray wrote a note on accelerating the resampling stage in particle filters by using a Metropolis step. And GPUs. The notion that Metropolis can be applied in this setting is at first puzzling since exact multinomial sampling is available. And Metropolis requires convergence guarantees. Which Lawrence covers by a Raftery and Lewis assessment, which has severe limitations in general but may well be adequate for this very case, although possibly too conservative in the number of recommended Metropolis iterations. The gain brought by Metropolis is that it does not require summing up all the particle weights, and as a result the gain is real in that Metropolis beats all other approaches (time-wise) when the number of particles is not too large and the heterogeneity of the weighs not too  high. (I did not know of this note until Richard Everitt brought it to my attention.)

occupancy rules

Posted in Kids, R, Statistics with tags , , , , , , , on May 23, 2016 by xi'an

While the last riddle on The Riddler was rather anticlimactic, namely to find the mean of the number Y of empty bins in a uniform multinomial with n bins and m draws, with solution

$\mathbb{E}[Y]=n(1-\frac{1}{n})^m,$

[which still has a link with e in that the fraction of empty bins converges to e⁻¹ when n=m], this led me to some more involved investigation on the distribution of Y. While it can be shown directly that the probability that k bins are non-empty is

${n \choose k}\sum_{i=1}^k (-1)^{k-i}{k \choose i}(i/n)^m$

with an R representation by

miss<-function(n,m){
p=rep(0,n)
for (k in 1:n)
p[k]=choose(n,k)*sum((-1)^((k-1):0)*choose(k,1:k)*(1:k)^m)
return(rev(p)/n^m)}


I wanted to take advantage of the moments of Y, since it writes as a sum of n indicators, counting the number of empty cells. However, the higher moments of Y are not as straightforward as its expectation and I struggled with the representation until I came upon this formula

$\mathbb{E}[Y^k]=\sum_{i=1}^k {k \choose i} i! S(k,i) \left( 1-\frac{i}{n}\right)^m$

where S(k,i) denotes the Stirling number of the second kind… Or i!S(n,i) is the number of surjections from a set of size n to a set of size i. Which leads to the distribution of Y by inverting the moment equations, as in the following R code:

diss<-function(n,m){
A=matrix(0,n,n)
mome=rep(0,n)
A[n,]=rep(1,n)
mome[n]=1
for (k in 1:(n-1)){
A[k,]=(0:(n-1))^k
for (i in 1:k)
mome[k]=mome[k]+factorial(i)*as.integer(Stirling2(n,i))*
(1-(i+1)/n)^m*factorial(k)/factorial(k-i-1)}
return(solve(A,mome))}


that I still checked by raw simulations from the multinomial

zample<-function(n,m,T=1e4){
x=matrix(sample(1:n,m*T,rep=TRUE),nrow=T)
x=sapply(apply(x,1,unique),length)
return(n-x)}


objectivity in prior distributions for the multinomial model

Posted in Statistics, University life with tags , , , , , on March 17, 2016 by xi'an

Today, Danilo Alvares visiting from the Universitat de Valencià gave a talk at CREST about choosing a prior for the Multinomial distribution. Comparing different Dirichlet priors. In a sense this is an hopeless task, first because there is no reason to pick a particular prior unless one picks a very specific and a-Bayesian criterion to discriminate between priors, second because the multinomial is a weird distribution, hardly a distribution at all in that it results from grouping observations into classes, often based on the observations themselves. A construction that should be included within the choice of the prior maybe? But there lurks a danger of ending up with a data-dependent prior. My other remark about this problem is that, among the token priors, Perk’s prior using 1/k as its hyper-parameter [where k is the number of categories] is rather difficult to justify compared with 1/k² or 1/k³, except for aggregation consistency to some extent. And Laplace’s prior gets highly concentrated as the number of categories grows.