Archive for finite mixtures

simulation fodder for future exams

Posted in Books, Kids, R, Statistics with tags , , , , on February 20, 2019 by xi'an

Here are two nice exercises for a future simulation exam, seen and solved on X validated.The first one is about simulating a Gibbs sampler associated with the joint target


defined over IR² for a≥0 (or possibly a>-1). The conditionals are identical and non-standard, but a simple bound on the conditional density is the corresponding standard double exponential density, which makes for a straightforward accept-reject implementation. However it is also feasible to break the full conditional into three parts, depending on the respective positions of x, y, and 0, and to obtain easily invertible cdfs on the three intervals.The second exercise is about simulating from the cdf


which can be numerically inverted. It is however more fun to call for an accept-reject algorithm by bounding the density with a ½ ½ mixture of an Exponential Exp(a) and of the 1/(p+1)-th power of an Exponential Exp(b/(p+1)). Since no extra constant appears in the solution,  I suspect the (p+1) in b/(p+1) was introduced on purpose. As seen in the above fit for 10⁶ simulations (and a=1,b=2,p=3), there is no deviation from the target! There is nonetheless an even simpler and rather elegant resolution to the exercise: since the tail function (1-F(x)) appears as the product of two tail functions, exp(-ax) and the other one, the cdf is the distribution of the minimum of two random variates, one with the Exp(a) distribution and the other one being the 1/(p+1)-th power of an Exponential Exp(b/(p+1)) distribution. Which of course returns a very similar histogram fit:

alternatives to EM

Posted in Books, Statistics with tags , , , , , , , on January 30, 2019 by xi'an

In an arXived preprint submitted to Computational Statistics & Data Analysis, Chan, Han, and Lim study alternatives to EM for latent class models. That is, mixtures of products of Multinomials. (First occurrence of an indicator function being called the “Iverson bracket function”!) The introduction is fairly extensive given this most studied model. The criticisms of EM laid by the authors are that (a) it does not produce an evaluation of the estimation error, which does not sound correct; (b) the convergence is slow, which is also rather misleading as my [low dimensional] experience with mixtures is that it gets very quickly and apparently linearly  to the vicinity of one of the modes. The argument in favour of alternative non-linear optimisation approaches is that they can achieve quadratic convergence. One solution is a projected Quasi-Newton method, based on a quadratic approximation to the target. With some additional intricacies that make the claim of being “way easier than EM algorithm” somewhat specious. The second approach proposed in the paper is sequential quadratic programming, which incorporates the Lagrange multiplier in the target. While the different simulations in the paper show that EM may indeed call for a much larger number of iterations, the obtained likelihoods all are comparable.

Big Bayes goes South

Posted in Books, Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , , , , on December 5, 2018 by xi'an

At the Big [Data] Bayes conference this week [which I found quite exciting despite a few last minute cancellations by speakers] there were a lot of clustering talks including the ones by Amy Herring (Duke), using a notion of centering that should soon appear on arXiv. By Peter Müller (UT, Austin) towards handling large datasets. Based on a predictive recursion that takes one value at a time, unsurprisingly similar to the update of Dirichlet process mixtures. (Inspired by a 1998 paper by Michael Newton and co-authors.) The recursion doubles in size at each observation, requiring culling of negligible components. Order matters? Links with Malsiner-Walli et al. (2017) mixtures of mixtures. Also talks by Antonio Lijoi and Igor Pruenster (Boconni Milano) on completely random measures that are used in creating clusters. And by Sylvia Frühwirth-Schnatter (WU Wien) on creating clusters for the Austrian labor market of the impact of company closure. And by Gregor Kastner (WU Wien) on multivariate factor stochastic models, with a video of a large covariance matrix evolving over time and catching economic crises. And by David Dunson (Duke) on distance clustering. Reflecting like myself on the definitely ill-defined nature of the [clustering] object. As the sample size increases, spurious clusters appear. (Which reminded me of a disagreement I had had with David McKay at an ICMS conference on mixtures twenty years ago.) Making me realise I missed the recent JASA paper by Miller and Dunson on that perspective.

Some further snapshots (with short comments visible by hovering on the picture) of a very high quality meeting [says one of the organisers!]. Following suggestions from several participants, it would be great to hold another meeting at CIRM in a near future. Continue reading

JSM 2018 [#4]

Posted in Mountains, Statistics, Travel, University life with tags , , , , , , , , , , , , , , on August 3, 2018 by xi'an

As last ½ day of sessions at JSM2018 in an almost deserted conference centre, with a first session set together by Mario Peruggia and a second on Advances in Bayesian Nonparametric Modeling and Computation for Complex Data. Here are the slides of my talk this morning in the Bayesian mixture estimation session.

which I updated last night (Slideshare most absurdly does not let you update versions!)

Since I missed the COPSS Award ceremony for a barbecue with friends on Locarno Beach, I only discovered this morning that the winner this year is Richard Samworth, from Cambridge University, who eminently deserves this recognition, if only because of his contributions to journal editing, as I can attest from my years with JRSS B. Congrats to him as well as to Bin Yu and Susan Murphy for their E.L. Scott and R.A. Fisher Awards!  I also found out from an email to JSM participants that the next edition is in Denver, Colorado, which I visited only once in 1993 on a trip to Fort Collins visiting Kerrie Mengersen and Richard Tweedie. Given the proximity to the Rockies, I am thinking of submitting an invited session on ABC issues, which were not particularly well covered by this edition of JSM. (Feel free to contact me if you are interested in joining the session.)

sliced Wasserstein estimation of mixtures

Posted in Books, pictures, R, Statistics with tags , , , , , , on November 28, 2017 by xi'an

A paper by Soheil Kolouri and co-authors was arXived last week about using Wasserstein distance for inference on multivariate Gaussian mixtures. The basic concept is that the parameter is estimated by minimising the p-Wasserstein distance to the empirical distribution, smoothed by a Normal kernel. As the general Wasserstein distance is quite costly to compute, the approach relies on a sliced version, which means computing the Wasserstein distance between one-dimensional projections of the distributions. Optimising over the directions is an additional computational constraint.

“To fit a finite GMM to the observed data, one is required to answer the following questions: 1) how to estimate the number of mixture components needed to represent the data, and 2) how to estimate the parameters of the mixture components.”

The paper contains a most puzzling comment opposing maximum likelihood estimation to minimum Wasserstein distance estimation on the basis that the later would not suffer from multimodality. This sounds incorrect as the multimodality of a mixture model (likelihood) stems from the lack of identifiability of the parameters. If all permutations of these parameters induce exactly the same distribution, they all stand at the same distance from the data distribution, whatever the distance is. Furthermore, the above tartan-like picture clashes with the representation of the log-likelihood of a Normal mixture, as exemplified by the picture below based on a 150 sample with means 0 and 2, same unit variance, and weights 0.3 and 0.7, which shows a smooth if bimodal structure:And for the same dataset, my attempt at producing a Wasserstein “energy landscape” does return a multimodal structure (this is the surface of minus the logarithm of the 2-Wasserstein distance):“Jin et al. proved that with random initialization, the EM algorithm will converge to a bad critical point with high probability.”

This statement is most curious in that the “probability” in the assessment must depend on the choice of the random initialisation, hence on a sort of prior distribution that is not explicited in the paper. Which remains blissfully unaware of Bayesian approaches.

Another [minor mode] puzzling statement is that the p-Wasserstein distance is defined on the space of probability measures with finite p-th moment, which does not make much sense when what matters is rather the finiteness of the expectation of the distance d(X,Y) raised to the power p. A lot of the maths details either do not make sense or seem superfluous.

relabelling in Bayesian mixtures by pivotal units

Posted in Statistics with tags , , , , on September 14, 2017 by xi'an

Yet another paper on relabelling for mixtures, when one would think everything and more has already be said and written on the topic… This one appeared in Statistics and Computing last August and I only became aware of it through ResearchGate which sent me an unsolicited email that this paper quoted one of my own papers. As well as Bayesian Essentials.

The current paper by Egidi, Pappadà, Pauli and Torelli starts from the remark that the similarity matrix of the probabilities for pairs of observations to be in the same component is invariant to label switching. A property we also used in our 2000 JASA paper. But here the authors assume it is possible to find pivots, that is, as many observations as there are components such that any pair of them is never in the same component with posterior probability one. These pivots are then used for the relabelling, as they define a preferential relabelling at each iteration. Now, this is not always possible since there are presumably iterations with empty components and there is rarely a zero probability that enough pairs never meet. The resolution of this quandary is then to remove the iterations for which this happens, a subsampling that changes the nature of the MCMC chain and may jeopardise its Markovian validation. The authors however suggest using alternative and computationally cheaper solutions to identify the pivots. (Which confuses me as to which solution they adopt.)

The next part of the paper compares this approach with seven other solutions found in the literature, from Matthew Stephens’ (2000) to our permutation reordering. Which does pretty well in terms of MSE in the simulation study (see the massive Table 3) while being much cheaper to implement than the proposed pivotal relabelling (Table 4). And which, contrary to the authors’ objection, does not require the precise computation of the MAP since, as indicated in our paper, the relative maximum based on the MCMC iterations can be used as a proxy. I am thus less than convinced at the improvement brought by this alternative…

repulsive mixtures

Posted in Books, Statistics with tags , , , , , , , , on April 10, 2017 by xi'an

Fangzheng Xie and Yanxun Xu arXived today a paper on Bayesian repulsive modelling for mixtures. Not that Bayesian modelling is repulsive in any psychological sense, but rather that the components of the mixture are repulsive one against another. The device towards this repulsiveness is to add a penalty term to the original prior such that close means are penalised. (In the spirit of the sugar loaf with water drops represented on the cover of Bayesian Choice that we used in our pinball sampler, repulsiveness being there on the particles of a simulated sample and not on components.) Which means a prior assumption that close covariance matrices are of lesser importance. An interrogation I have has is was why empty components are not excluded as well, but this does not make too much sense in the Dirichlet process formulation of the current paper. And in the finite mixture version the Dirichlet prior on the weights has coefficients less than one.

The paper establishes consistency results for such repulsive priors, both for estimating the distribution itself and the number of components, K, under a collection of assumptions on the distribution, prior, and repulsiveness factors. While I have no mathematical issue with such results, I always wonder at their relevance for a given finite sample from a finite mixture in that they give an impression that the number of components is a perfectly estimable quantity, which it is not (in my opinion!) because of the fluid nature of mixture components and therefore the inevitable impact of prior modelling. (As Larry Wasserman would pound in, mixtures like tequila are evil and should likewise be avoided!)

The implementation of this modelling goes through a “block-collapsed” Gibbs sampler that exploits the latent variable representation (as in our early mixture paper with Jean Diebolt). Which includes the Old Faithful data as an illustration (for which a submission of ours was recently rejected for using too old datasets). And use the logarithm of the conditional predictive ordinate as  an assessment tool, which is a posterior predictive estimated by MCMC, using the data a second time for the fit.