## sliced Wasserstein estimation of mixtures

Posted in Books, pictures, R, Statistics with tags , , , , , , on November 28, 2017 by xi'an

A paper by Soheil Kolouri and co-authors was arXived last week about using Wasserstein distance for inference on multivariate Gaussian mixtures. The basic concept is that the parameter is estimated by minimising the p-Wasserstein distance to the empirical distribution, smoothed by a Normal kernel. As the general Wasserstein distance is quite costly to compute, the approach relies on a sliced version, which means computing the Wasserstein distance between one-dimensional projections of the distributions. Optimising over the directions is an additional computational constraint.

“To fit a finite GMM to the observed data, one is required to answer the following questions: 1) how to estimate the number of mixture components needed to represent the data, and 2) how to estimate the parameters of the mixture components.”

The paper contains a most puzzling comment opposing maximum likelihood estimation to minimum Wasserstein distance estimation on the basis that the later would not suffer from multimodality. This sounds incorrect as the multimodality of a mixture model (likelihood) stems from the lack of identifiability of the parameters. If all permutations of these parameters induce exactly the same distribution, they all stand at the same distance from the data distribution, whatever the distance is. Furthermore, the above tartan-like picture clashes with the representation of the log-likelihood of a Normal mixture, as exemplified by the picture below based on a 150 sample with means 0 and 2, same unit variance, and weights 0.3 and 0.7, which shows a smooth if bimodal structure:And for the same dataset, my attempt at producing a Wasserstein “energy landscape” does return a multimodal structure (this is the surface of minus the logarithm of the 2-Wasserstein distance):“Jin et al. proved that with random initialization, the EM algorithm will converge to a bad critical point with high probability.”

This statement is most curious in that the “probability” in the assessment must depend on the choice of the random initialisation, hence on a sort of prior distribution that is not explicited in the paper. Which remains blissfully unaware of Bayesian approaches.

Another [minor mode] puzzling statement is that the p-Wasserstein distance is defined on the space of probability measures with finite p-th moment, which does not make much sense when what matters is rather the finiteness of the expectation of the distance d(X,Y) raised to the power p. A lot of the maths details either do not make sense or seem superfluous.

## relabelling in Bayesian mixtures by pivotal units

Posted in Statistics with tags , , , , on September 14, 2017 by xi'an

Yet another paper on relabelling for mixtures, when one would think everything and more has already be said and written on the topic… This one appeared in Statistics and Computing last August and I only became aware of it through ResearchGate which sent me an unsolicited email that this paper quoted one of my own papers. As well as Bayesian Essentials.

The current paper by Egidi, Pappadà, Pauli and Torelli starts from the remark that the similarity matrix of the probabilities for pairs of observations to be in the same component is invariant to label switching. A property we also used in our 2000 JASA paper. But here the authors assume it is possible to find pivots, that is, as many observations as there are components such that any pair of them is never in the same component with posterior probability one. These pivots are then used for the relabelling, as they define a preferential relabelling at each iteration. Now, this is not always possible since there are presumably iterations with empty components and there is rarely a zero probability that enough pairs never meet. The resolution of this quandary is then to remove the iterations for which this happens, a subsampling that changes the nature of the MCMC chain and may jeopardise its Markovian validation. The authors however suggest using alternative and computationally cheaper solutions to identify the pivots. (Which confuses me as to which solution they adopt.)

The next part of the paper compares this approach with seven other solutions found in the literature, from Matthew Stephens’ (2000) to our permutation reordering. Which does pretty well in terms of MSE in the simulation study (see the massive Table 3) while being much cheaper to implement than the proposed pivotal relabelling (Table 4). And which, contrary to the authors’ objection, does not require the precise computation of the MAP since, as indicated in our paper, the relative maximum based on the MCMC iterations can be used as a proxy. I am thus less than convinced at the improvement brought by this alternative…

## twilight zone [of statistics]

Posted in Books, pictures, R, Statistics, University life with tags , , , , , , , , , , on February 26, 2016 by xi'an

“I have decided that mixtures, like tequila, are inherently evil and should be avoided at all costs.” L. Wasserman

Larry Wasserman once remarked that finite mixtures were like the twilight zone of statistics, thanks to the numerous idiosyncrasies associated with such models. And George Casella had similar strong reservations about mixture estimation. Avi Feller and co-authors [including Natesh Pillai] have just arXived a paper on this topic, exhibiting shocking (!) properties of the MLE! Their core example is a mixture of two normal distributions with known common variance and known weight different from 0.5, which ensures identifiability. This is a favourite example of mine that we used for instance in our book Introducing Monte Carlo methods with R. If only because we can plot the likelihood and posterior surfaces. (Warning: I wrote those notes on an earlier version of the paper, so mileage may vary in terms of accuracy!)

The “shocking” discovery in the paper is that the MLE is wrong as often as not in selecting the sign of the difference Δ between both means, with an additional accumulation point at zero. The global mode may thus be in the wrong place for small enough sample sizes. And even for larger sizes: when the difference between the means is small the likelihood is likely to be unimodal with a mode quite close to zero. (An interesting remark is that the likelihood derivative is always zero at Δ=0 when considering the special case of both means equal to -Δ and to πΔ/(1-π), respectively, which implies that the overall mean of the mixture is equal to zero. A potential connection with our reparameterisation paper, maybe?)

The alternative proposed by Avi and his co-authors is to proceed through moments, i.e., to revert to Pearson (1892). There are however difficulties with this approach, first and foremost the non-uniqueness of the moment equations used to estimate Δ. For instance, the second cumulant equation chosen by the authors is not always defined as opposed to the third cumulant equation (why not using this third cumulant then). Which does not always produce the right sign… But, in a strange twist, the authors turn those deficiencies into signals for both pathologies (wrong sign and “pile-up” at zero).

“…the grid bootstrap yields an exact p-value for any valid test statistic.”

The most importance issue in this framework being in estimating the parameters, the authors opt for an approach based on tests, which is definitely surprising given the well-known deficiencies of standard tests in mixtures. The test chosen here is a Wald test with a statistic equal to the χ² version of the first cumulant differences. I am surprised that the χ² approximation works in such an unfriendly setting. And I do not understand how the grid is used, unless a certain degree of approximation is accepted, which takes us back to the “dark ages” of imposing a minimal distance Δ to achieve consistency, as in Ghosh and Sen (1985).

“..our concern about sign error is trivial in the Bayesian setting: the global mode is simply a poor summary of a multi-modal posterior. More broadly, the weak identification issues we highlight in this paper are not necessarily relevant to a strict Bayesian.”

A priori, I do not think pathologies of the MLE always transfer to Bayes estimators, unless one uses the MAP as an [poor] estimator. But using the MAP is not necessary since posterior means are meaningful in this identified setting, where label switching should not occur. However, running the same experiments with a Gaussian prior on both means and using the posterior mean as my estimator, I did obtain the same pathology of Bayes estimates [also produced in the supplementary material] not concentrating on the true value of the difference, but putting weight on the opposite value and at zero. Using a less standard prior inspired by David Rossell’s talk on non-local priors two weeks ago, which avoids a neighbourhood of zero, I did not get a much different picture as illustrated below:

Overall, I remain somewhat uncertain as to what to conclude from this pathological behaviour. When both means are close enough, the sign of the difference is often estimated wrongly. But that could simply mean that the means are not significantly different, for that sample size…

## mixtures as exponential families

Posted in Kids, Statistics with tags , , , , , , , on December 8, 2015 by xi'an

Something I had not realised earlier and that came to me when answering a question on X validated about the scale parameter of a Gamma distribution. Following an earlier characterisation by Dennis Lindley, Ferguson has written a famous paper characterising location, scale and location-scale families within exponential families. For instance, a one-parameter location exponential family is necessarily the logarithm of a power of a Gamma distribution. What I found surprising is the equivalent for one-parameter scale exponential families: they are necessarily mixtures of positive and negative powers of Gamma distributions. This is surprising because a mixture does not seem to fit within the exponential family representation… I first thought Ferguson was using a different type of mixtures. Or of exponential family. But, after checking the details, it appears that the mixture involves a component on ℜ⁺ and another component on ℜ⁻ with a potential third component as a Dirac mass at zero. Hence, it only nominally writes as a mixture and does not offer the same challenges as a regular mixture. Not label switching. No latent variable. Having mutually exclusive supports solves all those problems and even allows for an indicator function to permeate through the exponential function… (Recall that the special mixture processed in Rubio and Steel also enjoys this feature.)

## Overfitting Bayesian mixture models with an unknown number of components

Posted in Statistics with tags , , , , , , , , on March 4, 2015 by xi'an

During my Czech vacations, Zoé van Havre, Nicole White, Judith Rousseau, and Kerrie Mengersen1 posted on arXiv a paper on overfitting mixture models to estimate the number of components. This is directly related with Judith and Kerrie’s 2011 paper and with Zoé’s PhD topic. The paper also returns to the vexing (?) issue of label switching! I very much like the paper and not only because the author are good friends!, but also because it brings a solution to an approach I briefly attempted with Marie-Anne Gruet in the early 1990’s, just before finding about the reversible jump MCMC algorithm of Peter Green at a workshop in Luminy and considering we were not going to “beat the competition”! Hence not publishing the output of our over-fitted Gibbs samplers that were nicely emptying extra components… It also brings a rebuke about a later assertion of mine’s at an ICMS workshop on mixtures, where I defended the notion that over-fitted mixtures could not be detected, a notion that was severely disputed by David McKay…

What is so fantastic in Rousseau and Mengersen (2011) is that a simple constraint on the Dirichlet prior on the mixture weights suffices to guarantee that asymptotically superfluous components will empty out and signal they are truly superfluous! The authors here cumulate the over-fitted mixture with a tempering strategy, which seems somewhat redundant, the number of extra components being a sort of temperature, but eliminates the need for fragile RJMCMC steps. Label switching is obviously even more of an issue with a larger number of components and identifying empty components seems to require a lack of label switching for some components to remain empty!

When reading through the paper, I came upon the condition that only the priors of the weights are allowed to vary between temperatures. Distinguishing the weights from the other parameters does make perfect sense, as some representations of a mixture work without those weights. Still I feel a bit uncertain about the fixed prior constraint, even though I can see the rationale in not allowing for complete freedom in picking those priors. More fundamentally, I am less and less happy with independent identical or exchangeable priors on the components.

Our own recent experience with almost zero weights mixtures (and with Judith, Kaniav, and Kerrie) suggests not using solely a Gibbs sampler there as it shows poor mixing. And even poorer label switching. The current paper does not seem to meet the same difficulties, maybe thanks to (prior) tempering.

The paper proposes a strategy called Zswitch to resolve label switching, which amounts to identify a MAP for each possible number of components and a subsequent relabelling. Even though I do not entirely understand the way the permutation is constructed. I wonder in particular at the cost of the relabelling.

## trans-dimensional nested sampling and a few planets

Posted in Books, Statistics, Travel, University life with tags , , , , , , , , , on March 2, 2015 by xi'an

This morning, in the train to Dauphine (train that was even more delayed than usual!), I read a recent arXival of Brendon Brewer and Courtney Donovan. Entitled Fast Bayesian inference for exoplanet discovery in radial velocity data, the paper suggests to associate Matthew Stephens’ (2000)  birth-and-death MCMC approach with nested sampling to infer about the number N of exoplanets in an exoplanetary system. The paper is somewhat sparse in its description of the suggested approach, but states that the birth-date moves involves adding a planet with parameters simulated from the prior and removing a planet at random, both being accepted under a likelihood constraint associated with nested sampling. I actually wonder if this actually is the birth-date version of Peter Green’s (1995) RJMCMC rather than the continuous time birth-and-death process version of Matthew…

“The traditional approach to inferring N also contradicts fundamental ideas in Bayesian computation. Imagine we are trying to compute the posterior distribution for a parameter a in the presence of a nuisance parameter b. This is usually solved by exploring the joint posterior for a and b, and then only looking at the generated values of a. Nobody would suggest the wasteful alternative of using a discrete grid of possible a values and doing an entire Nested Sampling run for each, to get the marginal likelihood as a function of a.”

This criticism is receivable when there is a huge number of possible values of N, even though I see no fundamental contradiction with my ideas about Bayesian computation. However, it is more debatable when there are a few possible values for N, given that the exploration of the augmented space by a RJMCMC algorithm is often very inefficient, in particular when the proposed parameters are generated from the prior. The more when nested sampling is involved and simulations are run under the likelihood constraint! In the astronomy examples given in the paper, N never exceeds 15… Furthermore, by merging all N’s together, it is unclear how the evidences associated with the various values of N can be computed. At least, those are not reported in the paper.

The paper also omits to provide the likelihood function so I do not completely understand where “label switching” occurs therein. My first impression is that this is not a mixture model. However if the observed signal (from an exoplanetary system) is the sum of N signals corresponding to N planets, this makes more sense.

## relabelling mixtures (#2)

Posted in Statistics, Travel, University life with tags , , , , , , on February 5, 2015 by xi'an

Following the previous post, I went and had  a (long) look at Puolamäki and Kaski’s paper. I must acknowledge that, despite having several runs through the paper, I still have trouble with the approach… From what I understand, the authors use a Bernoulli mixture pseudo-model to reallocate the observations to components.  That is, given an MCMC output with simulated allocations variables (a.k.a., hidden or latent variables), they create a (TxK)xn matrix of component binary indicators e.g., for a three component mixture,

0 1 0 0 1 0…
1 0 0 0 0 0…
0 0 1 1 0 1…
0 1 0 0 1 1…

and estimate a probability to be in component j for each of the n observations, according to the (pseudo-)likelihood

$\prod_{r=1}^R \sum_{j=1}^K \prod_{i=1}^n \beta_{i,j}^{z_{i,r}}(1-\beta_{i,j})^{1-z_{i,r}}$

It took me a few days, between morning runs and those wee hours when I cannot get back to sleep (!), to make some sense of this Bernoulli modelling. The allocation vectors are used together to estimate the probabilities of being “in” component j together. However the data—which is the outcome of an MCMC simulation and de facto does not originate from that Bernoulli mixture—does not seem appropriate, both because it is produced by an MCMC simulation and is made of blocks of highly correlated rows [which sum up to one]. The Bernoulli likelihood above also defines a new model, with many more parameters than in the original mixture model. And I fail to see why perfect, partial or inexistent label switching [in the MCMC sequence] is not going to impact the estimation of the Bernoulli mixture. And why an argument based on a fixed parameter value (Theorem 3) extends to an MCMC outcome where parameters themselves are subjected to some degree of label switching. Bemused, I remain…