## ordered allocation sampler

Posted in Books, Statistics with tags , , , , , , , , , , , on November 29, 2021 by xi'an

Recently, Pierpaolo De Blasi and María Gil-Leyva arXived a proposal for a novel Gibbs sampler for mixture models. In both finite and infinite mixture models. In connection with Pitman (1996) theory of species sampling and with interesting features in terms of removing the vexing label switching features.

The key idea is to work with the mixture components in the random order of appearance in an exchangeable sequence from the mixing distribution (…) In accordance with the order of appearance, we derive a new Gibbs sampling algorithm that we name the ordered allocation sampler. “

This central idea is thus a reinterpretation of the mixture model as the marginal of the component model when its parameter is distributed as a species sampling variate. An ensuing marginal algorithm is to integrate out the weights and the allocation variables to only consider the non-empty component parameters and the partition function, which are label invariant. Which reminded me of the proposal we made in our 2000 JASA paper with Gilles Celeux and Merrilee Hurn (one of my favourite papers!). And of the [first paper in Statistical Methodology] 2004 partitioned importance sampling version with George Casella and Marty Wells. As in the later, the solution seems to require the prior on the component parameters to be conjugate (as I do not see a way to produce an unbiased estimator of the partition allocation probabilities).

The ordered allocation sample considers the posterior distribution of the different object made of the parameters and of the sequence of allocations to the components for the sample written in a given order, ie y¹,y², &tc. Hence y¹ always gets associated with component 1, y² with either component 1 or component 2, and so on. For this distribution, the full conditionals are available, incl. the full posterior on the number m of components, only depending on the data through the partition sizes and the number m⁺ of non-empty components. (Which relates to the debate as to whether or not m is estimable…) This sequential allocation reminded me as well of an earlier 2007 JRSS paper by Nicolas Chopin. Albeit using particles rather than Gibbs and applied to a hidden Markov model. Funny enough, their synthetic dataset univ4 almost resembles the Galaxy dataset (as in the above picture of mine)!

## ISBA 2021 grand finale

Posted in Kids, Mountains, pictures, Running, Statistics, Travel, University life, Wines with tags , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , on July 3, 2021 by xi'an

Last day of ISBA (and ISB@CIRM), or maybe half-day, since there are only five groups of sessions we can attend in Mediterranean time.

My first session was one on priors for mixtures, with 162⁺ attendees at 5:15am! (well, at 11:15 Wien or Marseille time), Gertrud Malsiner-Walli distinguishing between priors on number of components [in the model] vs number of clusters [in the data], with a minor question of mine whether or not a “prior” is appropriate for a data-dependent quantity. And Deborah Dunkel presenting [very early in the US!] anchor models for fighting label switching, which reminded me of the talk she gave at the mixture session of JSM 2018 in Vancouver. (With extensions to consistency and mixtures of regression.) And Clara Grazian debating on objective priors for the number of components in a mixture [in the Sydney evening], using loss functions to build these. Overall it seems there were many talks on mixtures and clustering this year.

After the lunch break, when several ISB@CIRM were about to leave, we ran the Objective Bayes contributed session, which actually included several Stein-like minimaxity talks. Plus one by Théo Moins from the patio of CIRM, with ciccadas in the background. Incredibly chaired by my friend Gonzalo, who had a question at the ready for each and every speaker! And then the Savage Awards II session. Which ceremony is postponed till Montréal next year. And which nominees are uniformly impressive!!! The winner will only be announced in September, via the ISBA Bulletin. Missing the ISBA general assembly for a dinner in Cassis. And being back for the Bayesian optimisation session.

I would have expected more talks at the boundary of BS & ML (as well as COVID and epidemic decision making), the dearth of which should be a cause for concern if researchers at this boundary do not prioritise ISBA meetings over more generic meetings like NeurIPS… (An exception was George Papamakarios’ talk on variational autoencoders in the Savage Awards II session.)

Many many thanks to the group of students at UConn involved in setting most of the Whova site and running the support throughout the conference. It indeed went on very smoothly and provided a worthwhile substitute for the 100% on-site version. Actually, I both hope for the COVID pandemic (or at least the restrictions attached to it) to abate and for the hybrid structure of meetings to stay, along with the multiplication of mirror workshops. Being together is essential to the DNA of conferences, but travelling to a single location is not so desirable, for many reasons. Looking for ISBA 2022, a year from now, either in Montréal, Québec, or in one of the mirror sites!

## label switching by optimal transport: Wasserstein to the rescue

Posted in Books, Statistics, Travel with tags , , , , , , , , , , , , , , on November 28, 2019 by xi'an

A new arXival by Pierre Monteiller et al. on resolving label switching by optimal transport. To appear in NeurIPS 2019, next month (where I will be, but extra muros, as I have not registered for the conference). Among other things, the paper was inspired from an answer of mine on X validated, presumably a première (and a dernière?!). Rather than picketing [in the likely unpleasant weather ]on the pavement outside the conference centre, here are my raw reactions to the proposal made in the paper. (Usual disclaimer: I was not involved in the review of this paper.)

“Previous methods such as the invariant losses of Celeux et al. (2000) and pivot alignments of Marin et al. (2005) do not identify modes in a principled manner.”

Unprincipled, me?! We did not aim at identifying all modes but only one of them, since the posterior distribution is invariant under reparameterisation. Without any bad feeling (!), I still maintain my position that using a permutation invariant loss function is a most principled and Bayesian approach towards a proper resolution of the issue. Even though figuring out the resulting Bayes estimate may prove tricky.

The paper thus adopts a different approach, towards giving a manageable meaning to the average of the mixture distributions over all permutations, not in a linear Euclidean sense but thanks to a Wasserstein barycentre. Which indeed allows for an averaged mixture density, although a point-by-point estimate that does not require switching to occur at all was already proposed in earlier papers of ours. Including the Bayesian Core. As shown above. What was first unclear to me is how necessary the Wasserstein formalism proves to be in this context. In fact, the major difference with the above picture is that the estimated barycentre is a mixture with the same number of components. Computing time? Bayesian estimate?

Green’s approach to the problem via a point process representation [briefly mentioned on page 6] of the mixture itself, as for instance presented in our mixture analysis handbook, should have been considered. As well as issues about Bayes factors examined in Gelman et al. (2003) and our more recent work with Kate Jeong Eun Lee. Where the practical impossibility of considering all possible permutations is processed by importance sampling.

An idle thought that came to me while reading this paper (in Seoul) was that a more challenging problem would be to face a model invariant under the action of a group with only a subset of known elements of that group. Or simply too many elements in the group. In which case averaging over the orbit would become an issue.

## from here to infinity

Posted in Books, Statistics, Travel with tags , , , , , , , , , , , , , on September 30, 2019 by xi'an

“Introducing a sparsity prior avoids overfitting the number of clusters not only for finite mixtures, but also (somewhat unexpectedly) for Dirichlet process mixtures which are known to overfit the number of clusters.”

On my way back from Clermont-Ferrand, in an old train that reminded me of my previous ride on that line that took place in… 1975!, I read a fairly interesting paper published in Advances in Data Analysis and Classification by [my Viennese friends] Sylvia Früwirth-Schnatter and Gertrud Malsiner-Walli, where they describe how sparse finite mixtures and Dirichlet process mixtures can achieve similar results when clustering a given dataset. Provided the hyperparameters in both approaches are calibrated accordingly. In both cases these hyperparameters (scale of the Dirichlet process mixture versus scale of the Dirichlet prior on the weights) are endowed with Gamma priors, both depending on the number of components in the finite mixture. Another interesting feature of the paper is to witness how close the related MCMC algorithms are when exploiting the stick-breaking representation of the Dirichlet process mixture. With a resolution of the label switching difficulties via a point process representation and k-mean clustering in the parameter space. [The title of the paper is inspired from Ian Stewart’s book.]

## the [not so infamous] arithmetic mean estimator

Posted in Books, Statistics with tags , , , , , , , , , on June 15, 2018 by xi'an

“Unfortunately, no perfect solution exists.” Anna Pajor

Another paper about harmonic and not-so-harmonic mean estimators that I (also) missed came out last year in Bayesian Analysis. The author is Anna Pajor, whose earlier note with Osiewalski I also spotted on the same day. The idea behind the approach [which belongs to the branch of Monte Carlo methods requiring additional simulations after an MCMC run] is to start as the corrected harmonic mean estimator on a restricted set A as to avoid tails of the distributions and the connected infinite variance issues that plague the harmonic mean estimator (an old ‘Og tune!). The marginal density p(y) then satisfies an identity involving the prior expectation of the likelihood function restricted to A divided by the posterior coverage of A. Which makes the resulting estimator unbiased only when this posterior coverage of A is known, which does not seem realist or efficient, except if A is an HPD region, as suggested in our earlier “safe” harmonic mean paper. And efficient only when A is well-chosen in terms of the likelihood function. In practice, the author notes that P(A|y) is to be estimated from the MCMC sequence and that the set A should be chosen to return large values of the likelihood, p(y|θ), through importance sampling, hence missing somehow the double opportunity of using an HPD region. Hence using the same default choice as in Lenk (2009), an HPD region which lower bound is derived as the minimum likelihood in the MCMC sample, “range of the posterior sampler output”. Meaning P(A|y)=1. (As an aside, the paper does not produce optimality properties or even heuristics towards efficiently choosing the various parameters to be calibrated in the algorithm, like the set A itself. As another aside, the paper concludes with a simulation study on an AR(p) model where the marginal may be obtained in closed form if stationarity is not imposed, which I first balked at, before realising that even in this setting both the posterior and the marginal do exist for a finite sample size, and hence the later can be estimated consistently by Monte Carlo methods.) A last remark is that computing costs are not discussed in the comparison of methods.

The final experiment in the paper is aiming at the marginal of a mixture model posterior, operating on the galaxy benchmark used by Roeder (1990) and about every other paper on mixtures since then (incl. ours). The prior is pseudo-conjugate, as in Chib (1995). And label-switching is handled by a random permutation of indices at each iteration. Which may not be enough to fight the attraction of the current mode on a Gibbs sampler and hence does not automatically correct Chib’s solution. As shown in Table 7 by the divergence with Radford Neal’s (1999) computations of the marginals, which happen to be quite close to the approximation proposed by the author. (As an aside, the paper mentions poor performances of Chib’s method when centred at the posterior mean, but this is a setting where the posterior mean is meaningless because of the permutation invariance. As another, I do not understand how the RMSE can be computed in this real data situation.) The comparison is limited to Chib’s method and a few versions of arithmetic and harmonic means. Missing nested sampling (Skilling, 2006; Chopin and X, 2011), and attuned importance sampling as in Berkoff et al. (2003), Marin, Mengersen and X (2005), and the most recent Lee and X (2016) in Bayesian Analysis.