**A**t BNP13, Brian Trippe presented the AISTAT 2022 paper he recently wrote with Tin D. Nguyen and Tamara Broderick. Which made me read their 2021 paper on the topic. There, they note that coupling may prove challenging, which they blame on label switching. Considering a naïve Gibbs sampler on the space of partitions, meaning allocating each data-point to one of the existing partitions or to a singleton, they construct an optimal transport coupling under Hamming distance. Which appears to be achievable in O(NK³log{K}), if K is the maximal number of partitions among both chains. The paper does not goes deeply into the implementation, which involves [to quote] (a) *computing* the* distances between each pair of partitions in the Cartesian product of supports of the Gibbs conditionals* and (b) *solving the optimal transport problem*. Except in the appendix where the book-keeping necessary to achieve O(K²) for pairwise distances and the remaining complexity follows from the standard Orlin’s algorithm. What remains unclear from the paper is that, while the chains couple faster (fastest?), the resulting estimators do not necessarily improve upon budget-equivalent alternatives. (The reason for the failure of the single chain in Figure 2 is hard to fathom.)

## Archive for label switching

## coupling for the Gibbs sampler

Posted in Books, Mountains, pictures, Running, Statistics, Travel, University life with tags AISTATS 2022, BNP13, Hamming distance, label switching, Lago Llanquihue, maximal coupling, optimal transport, Orlin's algorithm, Orsono volcano, partition, single chain on November 27, 2022 by xi'an## BNP13

Posted in Mountains, pictures, Running, Statistics, Travel with tags Bayesian non-parametrics, Bernstein-von Mises theorem, Biometrika, BNP13, Bruno de Finetti, Charles de Gaulle, Chile, conference, ISBA, jetlag, label switching, Lago Llanquihue, optimal coupling, optimal transport, parallel MCMC, Patagonia, Puerto Varas on October 28, 2022 by xi'anBNP13 is set in this incredible location on a massive lake (almost as large as Lac Saint Jean!) facing several tantalizing snow-capped volcanoes… My trip from Paris to Puerto Varas was quite smooth if relatively longish (but I slept close to 8 hours on the first leg and busied myself with Biometrika submissions the rest of the way). Leaving from Paris at midnight proved a double advantage as this was one of the last flights leaving, with hardly anyone in the airport. On Sunday, I arrived early enough to take a quick dip in Lake Llanquihue which was fairly cold and choppy!

Overall the conference is quite exhilarating as all talks are of interest and often covering on-going research. This may be one of the most engaging meetings I have attended in the past years! Plus a refreshing variety of topics and seniority in the speakers.

To start with a bang!, Sonia Petrone (Bocconi) gave a very nice plenary lecture in the most auspicious manner, covering her recent works on Bayesian prediction as an alternative way to run Bayesian inference (in connection with the incoming Read Paper by Fong et al.). She covered so much ground that I got lost before long (jetlag did not help!). However, an interesting feature underlying her talk is that, under exchangeability, the sequence of predictives converges to a random probability measure, a de Finetti way to construct the prior that is based on predictives. Avoiding in a sense the model and the prior on the parameters of that process. (The parameter is derived from the infinite exchangeable [or conditionally iid] sequence, but the sequence of predictives need be defined.) The drawback is that this approach involves infinite sequences, with practical truncation to a finite horizon being an approximation whose precision / error may prove elusive to characterise. The predictive approach also allows to recover a limiting Normal distribution (not a Bernstein-von Mises type!) and hence credible intervals on parameters and distributions.

While this is indeed a BNP conference (!), I was surprised to see lot of talks paying attention to clustering and even to mixtures, with again a recurrent imprecision on the meaning of a cluster. (Maybe this was already the case for BNP11 in Paris but I may have been too busy helping with catering to notice!) For instance, Brian Trippe (MIT) gave a quick intro on his (AISTATS 2022) work on parallel MCMC with coupling. As unbiased MCMC strongly improving upon naïve parallel MCMC relative to the computing cost. With an interesting example where coupling is agnostic to the labeling of random partitions in clustering problems, involving optimal transport, manageable in O(K³log(K)) time when K is the number of clusters.

## Another harmonic mean

Posted in Books, Statistics, University life with tags Chib's approximation, harmonic mean, harmonic mean estimator, HPD region, label switching, marginal likelihood, MaxEnt 2009, Monte Carlo Methods and Applications, normalising constant on May 21, 2022 by xi'an**Y**et another paper that addresses the approximation of the marginal likelihood by a truncated harmonic mean, a popular theme of mine. A 2020 paper by Johannes Reich, entitled *Estimating marginal likelihoods from the posterior draws through a geometric identity* and published in Monte Carlo Methods and Applications.

The geometric identity it aims at exploiting is that

for any (positive volume) compact set $A$. **This is exactly the same identity** as in an earlier and uncited 2017 paper by Ana Pajor, with the also quite similar (!) title *Estimating the Marginal Likelihood Using the Arithmetic Mean Identity* and which I discussed on the ‘Og, linked with another 2012 paper by Lenk. Also discussed here. This geometric or arithmetic identity is again related to the harmonic mean correction based on a HPD region A that Darren Wraith and myself proposed at MaxEnt 2009. And that Jean-Michel and I presented at Frontiers of statistical decision making and Bayesian analysis in 2010.

In this avatar, the set A is chosen close to an HPD region, once more, with a structure that allows for an exact computation of its volume. Namely an ellipsoid that contains roughly 50% of the simulations from the posterior (rather than our non-intersecting union of balls centered at the 50% HPD points), which assumes a Euclidean structure of the parameter space (or, in other words, depends on the parameterisation)In the mixture illustration, the author surprisingly omits Chib’s solution, despite symmetrised versions avoiding the label (un)switching issues. . What I do not get is how this solution gets around the label switching challenge in that set A remains an ellipsoid for multimodal posteriors, which means it either corresponds to a single mode [but then how can a simulation be restricted to a “*single **permutation of the indicator labels*“?] or it covers all modes but also the unlikely valleys in-between.

## identifying mixtures

Posted in Books, pictures, Statistics with tags clustering, finite mixtures, identifiability, JASA, label switching, MCMC, STAN on February 27, 2022 by xi'an**I** had not read this 2017 discussion of Bayesian mixture estimation by Michael Betancourt before I found it mentioned in a recent paper. Where he re-explores the issue of identifiability and label switching in finite mixture models. Calling somewhat abusively *degenerate* mixtures where all components share the same family, e.g., mixtures of Gaussians. Illustrated by Stan code and output. This is rather traditional material, in that the non-identifiability of mixture components has been discussed in many papers and at least as many solutions proposed to overcome the difficulties of exploring the posterior distribution. Including our 2000 JASA paper with Gilles Celeux and Merrilee Hurn. With my favourite approach being the label-free representations as a point process in the parameter space (following an idea of Peter Green) or as a collection of clusters in the latent variable space. I am much less convinced by ordering constraints: while they formally differentiate and therefore identify the individual components of a mixture, they partition the parameter space with no regard towards the geometry of the posterior distribution. With in turn potential consequences on MCMC explorations of this fragmented surface that creates barriers for simulated Markov chains. Plus further difficulties with inferior but attracting modes in identifiable situations.

## ordered allocation sampler

Posted in Books, Statistics with tags Data augmentation, Galaxy, Gibbs sampling, hidden Markov models, JASA, label switching, latent variable models, MCMC, partition function, random partition trees, SMC, statistical methodology on November 29, 2021 by xi'an**R**ecently, Pierpaolo De Blasi and María Gil-Leyva arXived a proposal for a novel Gibbs sampler for mixture models. In both finite and infinite mixture models. In connection with Pitman (1996) theory of species sampling and with interesting features in terms of removing the vexing label switching features.

“The key idea is to work with the mixture components in the random order of appearance in an exchangeable sequence from the mixing distribution (…) In accordance with the order of appearance, we derive a new Gibbs sampling algorithm that we name the ordered allocation sampler. “

This central idea is thus a reinterpretation of the mixture model as the marginal of the component model when its parameter is distributed as a species sampling variate. An ensuing marginal algorithm is to integrate out the weights and the allocation variables to only consider the non-empty component parameters and the partition function, which are label invariant. Which reminded me of the proposal we made in our 2000 JASA paper with Gilles Celeux and Merrilee Hurn (one of my favourite papers!). And of the [first paper in Statistical Methodology] 2004 partitioned importance sampling version with George Casella and Marty Wells. As in the later, the solution seems to require the prior on the component parameters to be conjugate (as I do not see a way to produce an unbiased estimator of the partition allocation probabilities).

The ordered allocation sample considers the posterior distribution of the different object made of the parameters and of the sequence of allocations to the components for the sample written in a given order, ie y¹,y², &tc. Hence y¹ always gets associated with component 1, y² with either component 1 or component 2, and so on. For this distribution, the full conditionals are available, incl. the full posterior on the number *m* of components, only depending on the data through the partition sizes and the number *m⁺* of non-empty components. (Which relates to the debate as to whether or not m is estimable…) This sequential allocation reminded me as well of an earlier 2007 JRSS paper by Nicolas Chopin. Albeit using particles rather than Gibbs and applied to a hidden Markov model. Funny enough, their synthetic dataset *univ4* almost resembles the Galaxy dataset (as in the above picture of mine)!