**W**hile in Denver, at JSM, I came across [across validated!] this primarily challenging problem of finding the posterior of the 10³ long probability vector of a Multinomial M(10⁶,p) when only observing the range of a realisation of M(10⁶,p). This sounded challenging because the distribution of the pair (min,max) is not available in closed form. (Although this allowed me to find a paper on the topic by the late Shanti Gupta, who was chair at Purdue University when I visited 32 years ago…) This seemed to call for ABC (especially since I was about to give an introductory lecture on the topic!, law of the hammer…), but the simulation of datasets compatible with the extreme values of both minimum and maximum, m=80 and M=12000, proved difficult when using a uniform Dirichlet prior on the probability vector, since these extremes called for both small and large values of the probabilities. However, I later realised that the problem could be brought down to a Multinomial with only three categories and the observation (m,M,n-m-M), leading to an obvious Dirichlet posterior and a predictive for the remaining 10³-2 realisations.

## Archive for multinomial distribution

## a problem that did not need ABC in the end

Posted in Books, pictures, Statistics, Travel with tags ABC, Approximate Bayesian computation, Colorado, cross validated, dawn, Denver, high rise, introductory opening lecture, jatp, JSM 2019, law of the hammer, multinomial distribution, predictive on August 8, 2019 by xi'an## alternatives to EM

Posted in Books, Statistics with tags Computational Statistics and Data Analysis, EM algorithm, finite mixtures, Iverson bracket function, Lagrangian, latent class model, multinomial distribution, quasi-Newton resolution on January 30, 2019 by xi'an**I**n an arXived preprint submitted to *Computational Statistics & Data Analysis*, Chan, Han, and Lim study alternatives to EM for latent class models. That is, mixtures of products of Multinomials. (First occurrence of an indicator function being called the “Iverson bracket function”!) The introduction is fairly extensive given this most studied model. The criticisms of EM laid by the authors are that (a) it does not produce an evaluation of the estimation error, which does not sound correct; (b) the convergence is slow, which is also rather misleading as my [low dimensional] experience with mixtures is that it gets very quickly and apparently linearly to the vicinity of one of the modes. The argument in favour of alternative non-linear optimisation approaches is that they can achieve quadratic convergence. One solution is a projected Quasi-Newton method, based on a quadratic approximation to the target. With some additional intricacies that make the claim of being “way easier than EM algorithm” somewhat specious. The second approach proposed in the paper is sequential quadratic programming, which incorporates the Lagrange multiplier in the target. While the different simulations in the paper show that EM may indeed call for a much larger number of iterations, the obtained likelihoods all are comparable.

## multinomial resampling by Metropolis

Posted in Books, Statistics with tags Metropolis-Hastings algorithm, multinomial distribution, particle degeneracy, Raftery and Lewis' number of iterations, stratified resampling, systematic resampling on December 28, 2017 by xi'an**A** few years ago Lawrence Murray wrote a note on accelerating the resampling stage in particle filters by using a Metropolis step. And GPUs. The notion that Metropolis can be applied in this setting is at first puzzling since exact multinomial sampling is available. And Metropolis requires convergence guarantees. Which Lawrence covers by a Raftery and Lewis assessment, which has severe limitations in general but may well be adequate for this very case, although possibly too conservative in the number of recommended Metropolis iterations. The gain brought by Metropolis is that it does not require summing up all the particle weights, and as a result the gain is real in that Metropolis beats all other approaches (time-wise) when the number of particles is not too large and the heterogeneity of the weighs not too high. (I did not know of this note until Richard Everitt brought it to my attention.)

## occupancy rules

Posted in Kids, R, Statistics with tags moment derivation, moments, multinomial distribution, occupancy, R, Stack Exchange, Stirling number, surjection on May 23, 2016 by xi'an**W**hile the last riddle on The Riddler was rather anticlimactic, namely to find the mean of the number Y of empty bins in a uniform multinomial with n bins and m draws, with solution

[which still has a link with e in that the fraction of empty bins converges to e⁻¹ when n=m], this led me to some more involved investigation on the distribution of Y. While it can be shown directly that the probability that k bins are non-empty is

with an R representation by

miss<-function(n,m){ p=rep(0,n) for (k in 1:n) p[k]=choose(n,k)*sum((-1)^((k-1):0)*choose(k,1:k)*(1:k)^m) return(rev(p)/n^m)}

I wanted to take advantage of the moments of Y, since it writes as a sum of n indicators, counting the number of empty cells. However, the higher moments of Y are not as straightforward as its expectation and I struggled with the representation until I came upon this formula

where S(k,i) denotes the Stirling number of the second kind… Or i!S(n,i) is the number of surjections from a set of size n to a set of size i. Which leads to the distribution of Y by inverting the moment equations, as in the following R code:

diss<-function(n,m){ A=matrix(0,n,n) mome=rep(0,n) A[n,]=rep(1,n) mome[n]=1 for (k in 1:(n-1)){ A[k,]=(0:(n-1))^k for (i in 1:k) mome[k]=mome[k]+factorial(i)*as.integer(Stirling2(n,i))* (1-(i+1)/n)^m*factorial(k)/factorial(k-i-1)} return(solve(A,mome))}

that I still checked by raw simulations from the multinomial

zample<-function(n,m,T=1e4){ x=matrix(sample(1:n,m*T,rep=TRUE),nrow=T) x=sapply(apply(x,1,unique),length) return(n-x)}

## objectivity in prior distributions for the multinomial model

Posted in Statistics, University life with tags Haldane's prior, Laplace's prior, multinomial distribution, non-informative priors, objective Bayes, prior selection on March 17, 2016 by xi'an**T**oday, Danilo Alvares visiting from the Universitat de Valencià gave a talk at CREST about choosing a prior for the Multinomial distribution. Comparing different Dirichlet priors. In a sense this is an hopeless task, first because there is no reason to pick a particular prior unless one picks a very specific and a-Bayesian criterion to discriminate between priors, second because the multinomial is a weird distribution, hardly a distribution at all in that it results from grouping observations into classes, often based on the observations themselves. A construction that should be included within the choice of the prior maybe? But there lurks a danger of ending up with a data-dependent prior. My other remark about this problem is that, among the token priors, Perk’s prior using 1/k as its hyper-parameter [where k is the number of categories] is rather difficult to justify compared with 1/k² or 1/k³, except for aggregation consistency to some extent. And Laplace’s prior gets highly concentrated as the number of categories grows.