## relabelling mixtures (#2)

Posted in Statistics, Travel, University life with tags , , , , , , on February 5, 2015 by xi'an

Following the previous post, I went and had  a (long) look at Puolamäki and Kaski’s paper. I must acknowledge that, despite having several runs through the paper, I still have trouble with the approach… From what I understand, the authors use a Bernoulli mixture pseudo-model to reallocate the observations to components.  That is, given an MCMC output with simulated allocations variables (a.k.a., hidden or latent variables), they create a (TxK)xn matrix of component binary indicators e.g., for a three component mixture,

0 1 0 0 1 0…
1 0 0 0 0 0…
0 0 1 1 0 1…
0 1 0 0 1 1…

and estimate a probability to be in component j for each of the n observations, according to the (pseudo-)likelihood

$\prod_{r=1}^R \sum_{j=1}^K \prod_{i=1}^n \beta_{i,j}^{z_{i,r}}(1-\beta_{i,j})^{1-z_{i,r}}$

It took me a few days, between morning runs and those wee hours when I cannot get back to sleep (!), to make some sense of this Bernoulli modelling. The allocation vectors are used together to estimate the probabilities of being “in” component j together. However the data—which is the outcome of an MCMC simulation and de facto does not originate from that Bernoulli mixture—does not seem appropriate, both because it is produced by an MCMC simulation and is made of blocks of highly correlated rows [which sum up to one]. The Bernoulli likelihood above also defines a new model, with many more parameters than in the original mixture model. And I fail to see why perfect, partial or inexistent label switching [in the MCMC sequence] is not going to impact the estimation of the Bernoulli mixture. And why an argument based on a fixed parameter value (Theorem 3) extends to an MCMC outcome where parameters themselves are subjected to some degree of label switching. Bemused, I remain…

## importance sampling schemes for evidence approximation [revised]

Posted in Statistics, University life with tags , , , , , , , on November 18, 2014 by xi'an

After a rather intense period of new simulations and versions, Juong Een (Kate) Lee and I have now resubmitted our paper on (some) importance sampling schemes for evidence approximation in mixture models to Bayesian Analysis. There is no fundamental change in the new version but rather a more detailed description of what those importance schemes mean in practice. The original idea in the paper is to improve upon the Rao-Blackwellisation solution proposed by Berkoff et al. (2002) and later by Marin et al. (2005) to avoid the impact of label switching on Chib’s formula. The Rao-Blackwellisation consists in averaging over all permutations of the labels while the improvement relies on the elimination of useless permutations, namely those that produce a negligible conditional density in Chib’s (candidate’s) formula. While the improvement implies truncated the overall sum and hence induces a potential bias (which was the concern of one referee), the determination of the irrelevant permutations after relabelling next to a single mode does not appear to cause any bias, while reducing the computational overload. Referees also made us aware of many recent proposals that conduct to different evidence approximations, albeit not directly related with our purpose. (One was Rodrigues and Walker, 2014, discussed and commented in a recent post.)

## Le Monde puzzle [#868]

Posted in Books, Kids, Statistics with tags , , , on June 1, 2014 by xi'an

Another permutation-based Le Monde mathematical puzzle:

Given the integers 1,…n, a “perfect” combination is a pair (i,j) of integers such that no other pair enjoys the same sum. For n=33, what is the maximum of perfect combinations one can build? And for n=214?

A rather straightforward problem, or so it seemed: take the pairs (2m,2m+1), their sums all differ, and we get the maximal possible number of sums, ⌊n/2⌋… However, I did not read the question properly (!) and the constraint is on the sum (i+j), namely

How many mutually exclusive pairs (i,j) can be found with different sums all bounded by n=33? n=2014?

In which case, the previous and obvious proposal works no longer… The dumb brute-force search

T=10^6
sol=0
for (t in 1:T){
perm=sample(1:sample(seq(20,32,2),1))
sal=sum(unique(apply(matrix(perm,ncol=2),1,sum))<33)
if (sal>sol){
sol=sal;laperm=perm}
}


> sol
[1] 12
> laperm
[1]  6  9  1 24 13 20  4  7 21 14 17  3 16 11 19 25 23 18 12 26 15  2  5 10 22
[26]  8
> unique(apply(matrix(laperm,ncol=2),1,sum))
[1] 17 28 26 47 31 32 30 22 23 19 27 25 24


which is close of the solution sol=13 proposed in Le Monde… It is obviously hopeless for a sum bounded by 2014. A light attempt at simulated annealing did not help either.

## Importance sampling schemes for evidence approximation in mixture models

Posted in R, Statistics, University life with tags , , , , , , , , , on November 27, 2013 by xi'an

Jeong Eun (Kate) Lee and I completed this paper, “Importance sampling schemes for evidence approximation in mixture models“, now posted on arXiv. (With the customary one-day lag for posting, making me bemoan the days of yore when arXiv would give a definitive arXiv number at the time of submission.) Kate came twice to Paris in the past years to work with me on this evaluation of Chib’s original marginal likelihood estimate (also called the candidate formula by Julian Besag). And on the improvement proposed by Berkhof, van Mechelen, and Gelman (2003), based on averaging over all permutations, idea that we rediscovered in an earlier paper with Jean-Michel Marin. (And that Andrew seemed to have completely forgotten. Despite being the very first one to publish [in English] a paper on a Gibbs sampler for mixtures.) Given that this averaging can get quite costly, we propose a preliminary step to reduce the number of relevant permutations to be considered in the averaging, removing far-away modes that do not contribute to the Rao-Blackwell estimate and called dual importance sampling. We also considered modelling the posterior as a product of k-component mixtures on the components, following a vague idea I had in the back of my mind for many years, but it did not help. In the above boxplot comparison of estimators, the marginal likelihood estimators are

1. Chib’s method using T = 5000 samples with a permutation correction by multiplying by k!.
2. Chib’s method (1), using T = 5000 samples which are randomly permuted.
3. Importance sampling estimate (7), using the maximum likelihood estimate (MLE) of the latents as centre.
4. Dual importance sampling using q in (8).
5. Dual importance sampling using an approximate in (14).
6. Bridge sampling (3). Here, label switching is imposed in hyperparameters.

## Le Monde puzzle [#817]

Posted in Books, Kids, R with tags , , , , on April 19, 2013 by xi'an

The weekly Le Monde puzzle is (again) a permutation problem that can be rephrased as follows:

Find

$\min_{\sigma\in\mathfrak{S}_{10}} \max_{0\le i\le 10}\ \sigma_i+\sigma_{i+1}$

where $\mathfrak{S}_{10}$ denotes the set of permutations on {0,…,10} and $\sigma_i$ is defined modulo 11 [to turn {0,…,10} into a torus]. Same question for

$\min_{\sigma\in\mathfrak{S}_{10}} \max_{0\le i\le 10}\ \sigma_i+\sigma_{i+1}+\sigma_{i+2}$

and for

$\min_{\sigma\in\mathfrak{S}_{10}} \max_{0\le i\le 10}\ \sigma_i+\cdots+\sigma_{i+5}$

This is rather straightforward to code if one adopts a brute-force approach::

perminmax=function(T=10^3){
mins=sums=rep(500,3)
permin=matrix(0:10,ncol=11,nrow=3,byrow=TRUE)

for (t in 1:T){
perms=sample(0:10)
for (j in 1:3)
if (sums[j]<mins[j]){
mins[j]=sums[j];permin[j,]=perms}
}

print(mins)
print(permin)
}


(where I imposed the first term to be zero because of the invariance by permutation), getting the solutions

> perminmax(10^5)
[1] 11 17 28
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
[1,]    0   10    1    6    5    4    7    3    8     2     9
[2,]    0   10    4    3    5    1    9    6    2     8     7
[3,]    0    2    9    6    7    3    1    4   10     8     5


for 2, 3, and 5 terms.  (Since 10! is a mere 3.6 million, I checked with an exhaustive search, using the function permutation from the gtools package.)

## AMOR at 5000ft in a water tank…

Posted in Mountains, pictures, Statistics, University life with tags , , , , , , , , , , , , , , on November 22, 2012 by xi'an

On Monday, I attended the thesis defence of Rémi Bardenet in Orsay as a member (referee) of his thesis committee. While this was a thesis in computer science, which took place in the Linear Accelerator Lab in Orsay, it was clearly rooted in computational statistics, hence justifying my presence in the committee. The justification (!) for the splashy headline of this post is that Rémi’s work was motivated by the Pierre-Auger experiment on ultra-high-energy cosmic rays, where particles are detected through a network of 1600 water tanks spread over the Argentinian Pampa Amarilla on an area the size of Rhode Island (where I am incidentally going next week).

The part of Rémi’s thesis presented during the defence concentrated on his AMOR algorithm, arXived in a paper written with Olivier Cappé and Gersende Fort. AMOR stands for adaptive Metropolis online relabelling and combines adaptive MCMC techniques with relabelling strategies to fight label-switching (e.g., in mixtures). I have been interested in mixtures for eons (starting in 1987 in Ottawa with applying Titterington, Smith, and Makov to chest radiographs) and in label switching for ages (starting at the COMPSTAT conférence in Bristol in 1998). Rémi’s approach to the label switching problem follows the relabelling path, namely a projection of the original parameter space into a smaller subspace (that is also a quotient space) to avoid permutation invariance and lack of identifiability. (In the survey I wrote with Kate Lee, Jean-Michel Marin and Kerrie Mengersen, we suggest using the mode as a pivot to determine which permutation to use on the components of the mixture.) The paper suggests using an Euclidean distance to a mean determined adaptively, μt, with a quadratic form Σt also determined on-the-go, minimising (Pθ-μt)TΣt(Pθ-μt) over all permutations P at each step of the algorithm. The intuition behind the method is that the posterior over the restricted space should look like a roughly elliptically symmetric distribution, or at least like a unimodal distribution, rather than borrowing bits and pieces from different modes. While I appreciate the technical tour de force represented by the proof of convergence of the AMOR algorithm, I remain somehow sceptical about the approach and voiced the following objections during the defence: first, the assumption that the posterior becomes unimodal under an appropriate restriction is not necessarily realistic. Secondary modes often pop in with real data (as in the counter-example we used in our paper with Alessandra Iacobucci and Jean-Michel Marin). Next, the whole apparatus of fighting multiple modes and non-identifiability, i.e. fighting label switching, is to fall back on posterior means as Bayes estimators. As stressed in our JASA paper with Gilles Celeux and Merrilee Hurn, there is no reason for doing so and there are several reasons for not doing so:

• it breaks down under model specification, i.e., when the number of components is not correct
• it does not improve the speed of convergence but, on the opposite, restricts the space visited by the Markov chain
• it may fall victim to the fatal attraction of secondary modes by fitting too small an ellipse around one of those modes
• it ultimately depends on the parameterisation of the model
• there is no reason for using posterior means in mixture problems, posterior modes or cluster centres can be used instead

I am therefore very much more in favour of producing a posterior distribution that is as label switching as possible (since the true posterior is completely symmetric in this respect). Post-processing the resulting sample can be done by using off-the-shelf clustering in the component space, derived from the point process representation used by Matthew Stephens in his thesis and subsequent papers. It also allows for a direct estimation of the number of components.

In any case, this was a defence worth-attending that led me to think afresh about the label switching problem, with directions worth exploring next month while Kate Lee is visiting from Auckland. Rémi Bardenet is now headed for a postdoc in Oxford, a perfect location to discuss further label switching and to engage into new computational statistics research!

## [not] Le Monde puzzle (6)

Posted in University life with tags , , , , , , , on April 12, 2012 by xi'an

After posting the question on dinner table permutations on StackExchange (mathematics), I got almost instantaneously a reply that the right number was six, linking to the Covering chapter by Gordan and Stinson, taken from the second edition of Handbook of Combinatorial Designs. If I understand correctly this terse coverage that reminds me of Gradshteyn and Ryzhik’s monumental Table of Integrals, Series and Products, my question is connected to the covering number C(20,5,2), which is equal to 21 according to Table 1.33 of the above chapter. This is the minimum number of blocks such that every pair of points occurs in at least one block (element) of the collection of subsets of size 5 (the tables across courses). Now, in my problem, the interesting coverage is a resolvable coverage, i.e., a collection of tables that “can be partitioned into parallel classes, each of which consists of v/k[=4] disjoint blocks” (Definition 1.43). Then the number of parallel classes is r(4,5)=6, as provided by hardmath. Now the remaining question is how to build the resolvable 2-(20,5,1) covering, i.e. the table plans across the six courses… This may be related to a direct proof as to why a five course dinner does not work.