We have been working towards a revision of our reparameterisation paper for quite a while now and too advantage of Kate Lee visiting Paris this fortnight to make a final round: we have now arXived (and submitted) the new version. The major change against the earlier version is the extension of the approach to a large class of models that include infinitely divisible distributions, compound Gaussian, Poisson, and exponential distributions, and completely monotonic densities. The concept remains identical: change the parameterisation of a mixture from a component-wise decomposition to a construct made of the first moment(s) of the distribution and of component-wise objects constrained by the moment equation(s). There is of course a bijection between both parameterisations, but the constraints appearing in the latter produce compact parameter spaces for which (different) uniform priors can be proposed. While the resulting posteriors are no longer conjugate, even conditional on the latent variables, standard Metropolis algorithms can be implemented to produce Monte Carlo approximations of these posteriors.
Archive for mixtures of distributions
Following my earlier post on Kinderman’s and Monahan’s (1977) ratio-of-uniform method, I must confess I remain quite puzzled by the approach. Or rather by its consequences. When looking at the set A of (u,v)’s in R⁺×X such that 0≤u²≤ƒ(v/u), as discussed in the previous post, it can be represented by its parameterised boundary
u(x)=√ƒ(x),v(x)=x√ƒ(x) x in X
Similarly, since the simulation from ƒ(v/u) can also be derived [check Luc Devroye’s Non-uniform random variate generation in the exercise section 7.3] from a uniform on the set B of (u,v)’s in R⁺×X such that 0≤u≤ƒ(v+u), on the set C of (u,v)’s in R⁺×X such that 0≤u³≤ƒ(v/√u)², or on the set D of (u,v)’s in R⁺×X such that 0≤u²≤ƒ(v/u), which is actually exactly the same as A [and presumably many other versions!, for which I would like to guess the generic rule of construction], there are many sets on which one can consider running simulations. And one to pick for optimality?! Here are the three sets for a mixture of two normal densities:
For instance, assuming slice sampling is feasible on every one of those three sets, which one is the most efficient? While I have no clear answer to this question, I found on Sunday night that a generic family of transforms is indexed by a differentiable monotone function h over the positive half-line, with the uniform distribution being taken over the set
when the primitive G of g is the inverse of h, i.e., G(h(x))=x. [Here are the slides I gave at the Warwick reading group on Devroye’s book last week:]
Today, I went to Milano for 13 hours to give a seminar at l’Università Bocconi. Where I thus gave a talk on Testing via mixtures (using the same slides as at ISBA last Spring). It was the first time I was in Milano (and thus at Bocconi) for more than a transfer to MCMski or to Pavia and it was great to walk through the city. And of course to meet and share with many friends there. While I glimpsed the end of the sunrise on the Italian Alps (near Monte Rosa?!), I was too late on my way back for the sunset.
A. Mootoovaloo, B. Bassett, and M. Kunz just arXived a paper on the computation of Bayes factors by the Savage-Dickey representation through a supermodel (or encompassing model). (I wonder why Savage-Dickey is so popular in astronomy and cosmology statistical papers and not so much elsewhere.) Recall that the trick is to write the Bayes factor in favour of the encompasssing model as the ratio of the posterior and of the prior for the tested parameter (thus eliminating nuisance or common parameters) at its null value,
Modulo some continuity constraints on the prior density, and the assumption that the conditional prior on nuisance parameter is the same under the null model and the encompassing model [given the null value φ⁰]. If this sounds confusing or even shocking from a mathematical perspective, check the numerous previous entries on this topic on the ‘Og!
The supermodel created by the authors is a mixture of the original models, as in our paper, and… hold the presses!, it is a mixture of the likelihood functions, as in Phil O’Neill’s and Theodore Kypraios’ paper. Which is not mentioned in the current paper and should obviously be. In the current representation, the posterior distribution on the mixture weight α is a linear function of α involving both evidences, α(m¹-m²)+m², times the artificial prior on α. The resulting estimator of the Bayes factor thus shares features with bridge sampling, reversible jump, and the importance sampling version of nested sampling we developed in our Biometrika paper. In addition to O’Neill and Kypraios’s solution.
The following quote is inaccurate since the MCMC algorithm needs simulating the parameters of the compared models in realistic settings, hence representing the multidimensional integrals by Monte Carlo versions.
“Though we have a clever way of avoiding multidimensional integrals to calculate the Bayesian Evidence, this new method requires very efficient sampling and for a small number of dimensions is not faster than individual nested sampling runs.”
I actually wonder at the sheer rationale of running an intensive MCMC sampler in such a setting, when the weight α is completely artificial. It is only used to jump from one model to the next, which sound quite inefficient when compared with simulating from both models separately and independently. This approach can also be seen as a special case of Carlin’s and Chib’s (1995) alternative to reversible jump. Using instead the Savage-Dickey representation is of course infeasible. Which makes the overall reference to this method rather inappropriate in my opinion. Further, the examples processed in the paper all involve (natural) embedded models where the original Savage-Dickey approach applies. Creating an additional model to apply a pseudo-Savage-Dickey representation does not sound very compelling…
Incidentally, the paper also includes a discussion of a weird notion, the likelihood of the Bayes factor, B¹², which is plotted as a distribution in B¹², most strangely. The only other place I met this notion is in Murray Aitkin’s book. Something’s unclear there or in my head!
“One of the fundamental choices when using the supermodel approach is how to deal with common parameters to the two models.”
This is an interesting question, although maybe not so relevant for the Bayes factor issue where it should not matter. However, as in our paper, multiplying the number of parameters in the encompassing model may hinder convergence of the MCMC chain or reduce the precision of the approximation of the Bayes factor. Again, from a Bayes factor perspective, this does not matter [while it does in our perspective].
“This formulation reveals an interesting connection between multiple hypothesis testing and mixture modelling with the class labels corresponding to the accepted hypotheses in each test.”
After my seminar at Monash University last Friday, David Dowe pointed out to me the recent work by Enes Makalic and Daniel Schmidt on minimum description length (MDL) methods for multiple testing as somewhat related to our testing by mixture paper. Work which appeared in the proceedings of the 4th Workshop on Information Theoretic Methods in Science and Engineering (WITMSE-11), that took place in Helsinki, Finland, in 2011. Minimal encoding length approaches lead to choosing the model that enjoys the smallest coding length. Connected with, e.g., Rissannen‘s approach. The extension in this paper consists in considering K hypotheses at once on a collection of m datasets (the multiple then bears on the datasets rather than on the hypotheses). And to associate an hypothesis index to each dataset. When the objective function is the sum of (generalised) penalised likelihoods [as in BIC], it leads to selecting the “minimal length” model for each dataset. But the authors introduce weights or probabilities for each of the K hypotheses, which indeed then amounts to a mixture-like representation on the exponentiated codelengths. Which estimation by optimal coding was first proposed by Chris Wallace in his book. This approach eliminates the model parameters at an earlier stage, e.g. by maximum likelihood estimation, to return a quantity that only depends on the model index and the data. In fine, the purpose of the method differs from ours in that the former aims at identifying an appropriate hypothesis for each group of observations, rather than ranking those hypotheses for the entire dataset by considering the posterior distribution of the weights in the later. The mixture has somehow more of a substance in the first case, where separating the datasets into groups is part of the inference.
As we were working on the Handbook of mixture analysis with Sylvia Früwirth-Schnatter and Gilles Celeux today, near Saint-Germain des Près, I realised that there was a mistake in our 1990 mixture paper with Jean Diebolt [published in 1994], in that when we are proposing to use improper “Jeffreys” priors under the restriction that no component of the Gaussian mixture is “empty”, meaning that there are at least two observations generated from each component, the likelihood needs to be renormalised to be a density for the sample. This normalisation constant only depends on the weights of the mixture, which means that, when simulating from the full conditional distribution of the weights, there should be an extra-acceptance step to account for this correction. Of course, the term is essentially equal to one for a large enough sample but this remains a mistake nonetheless! It is funny that it remained undetected for so long in my most cited paper. Checking on Larry’s 1999 paper exploring the idea of excluding terms from the likelihood to allow for improper priors, I did not spot him using a correction either.
Sida Wang, Arun Tejasvi, and Chaganty Percy Liang have just arXived a paper about using the method of moments to estimate mixtures of distributions. Method that was introduced (?) by Pearson in 1894 for a Gaussian mixture and crab data. And studied in fair details by Bruce Lindsay and his co-authors, including his book, which makes it the more surprising that Bruce’s work is not mentioned at all in the paper. In particular the 1989 Annals of Statistics paper which connects the number of components with the rank of a moment matrix in exponential family and which made a strong impression on me at the time, just when I was starting to work on mixtures. The current paper addresses more specifically the combinatoric difficulty of solving the moment equation. The solution proceeds via a relaxed convex optimisation problem involving a moment matrix, the relaxation removing the rank condition that identifies the parameters of the mixture. While I am no expert in the resolution of the associated eigenvalue problem (Algorithm 1), I wonder at (i) the existence and convergence of a solution when using empirical moments. And (ii) the impact of the choice of the moment equations, on both existence and efficiency of the moment method. It is clearly not invariant by reparameterisation, hence parameterisation matters. It is even unclear to me how many terms should be used in the resolution: if a single dimension is acceptable, determining this dimension may prove a complex issue.