Archive for mixtures of distributions

parameter space for mixture models

Posted in Statistics, University life with tags , , , on March 24, 2017 by xi'an

“The paper defines a new solution to the problem of defining a suitable parameter space for mixture models.”

When I received the table of contents of the incoming Statistics & Computing and saw a paper by V. Maroufy and P. Marriott about the above, I was quite excited about a new approach to mixture parameterisation. Especially after our recent reposting of the weakly informative reparameterisation paper. Alas, after reading the paper, I fail to see the (statistical) point of the whole exercise.

Starting from the basic fact that mixtures face many identifiability issues, not only invariance by component permutation, but the possibility to add spurious components as well, the authors move to an entirely different galaxy by defining mixtures of so-called local mixtures. Developed by one of the authors. The notion is just incomprehensible for me: the object is a weighted sum of the basic component of the original mixture, e.g., a Normal density, and of k of its derivatives wrt its mean, a sort of parameterised Taylor expansion. Which implies the parameter is unidimensional, incidentally. The weights of this strange mixture are furthermore constrained by the positivity of the resulting mixture, a constraint that seems impossible to satisfy in the Normal case when the number of derivatives is odd. And hard to analyse in any case since possibly negative components do not enjoy an interpretation as a probability density. In exponential families, the local mixture is the original exponential family density multiplied by a polynomial. The current paper moves one step further [from the reasonable] by considering mixtures [in the standard sense] of such objects. Which components are parameterised by their mean parameter and a collection of weights. The authors then restrict the mean parameters to belong to a finite and fixed set, which elements are coerced by a maximum error rate on any compound distribution derived from this exponential family structure. The remainder of the paper discusses of the choice of the mean parameters and of an EM algorithm to estimate the parameters, with a confusing lower bound on the mixture weights that impacts the estimation of the weights. And no mention made of the positivity constraint. I remain completely bemused by the paper and its purpose: I do not even fathom how this qualifies as a mixture.

a response by Ly, Verhagen, and Wagenmakers

Posted in Statistics with tags , , , , , , , , on March 9, 2017 by xi'an

Following my demise [of the Bayes factor], Alexander Ly, Josine Verhagen, and Eric-Jan Wagenmakers wrote a very detailed response. Which I just saw the other day while in Banff. (If not in Schiphol, which would have been more appropriate!)

“In this rejoinder we argue that Robert’s (2016) alternative view on testing has more in common with Jeffreys’s Bayes factor than he suggests, as they share the same ‘‘shortcomings’’.”

Rather unsurprisingly (!), the authors agree with my position on the dangers to ignore decisional aspects when using the Bayes factor. A point of dissension is the resolution of the Jeffreys[-Lindley-Bartlett] paradox. One consequence derived by Alexander and co-authors is that priors should change between testing and estimating. Because the parameters have a different meaning under the null and under the alternative, a point I agree with in that these parameters are indexed by the model [index!]. But with which I disagree when arguing that the same parameter (e.g., a mean under model M¹) should have two priors when moving from testing to estimation. To state that the priors within the marginal likelihoods “are not designed to yield posteriors that are good for estimation” (p.45) amounts to wishful thinking. I also do not find a strong justification within the paper or the response about choosing an improper prior on the nuisance parameter, e.g. σ, with the same constant. Another a posteriori validation in my opinion. However, I agree with the conclusion that the Jeffreys paradox prohibits the use of an improper prior on the parameter being tested (or of the test itself). A second point made by the authors is that Jeffreys’ Bayes factor is information consistent, which is correct but does not solved my quandary with the lack of precise calibration of the object, namely that alternatives abound in a non-informative situation.

“…the work by Kamary et al. (2014) impressively introduces an alternative view on testing, an algorithmic resolution, and a theoretical justification.”

The second part of the comments is highly supportive of our mixture approach and I obviously appreciate very much this support! Especially if we ever manage to turn the paper into a discussion paper! The authors also draw a connection with Harold Jeffreys’ distinction between testing and estimation, based upon Laplace’s succession rule. Unbearably slow succession law. Which is well-taken if somewhat specious since this is a testing framework where a single observation can send the Bayes factor to zero or +∞. (I further enjoyed the connection of the Poisson-versus-Negative Binomial test with Jeffreys’ call for common parameters. And the supportive comments on our recent mixture reparameterisation paper with Kaniav Kamari and Kate Lee.) The other point that the Bayes factor is more sensitive to the choice of the prior (beware the tails!) can be viewed as a plus for mixture estimation, as acknowledged there. (The final paragraph about the faster convergence of the weight α is not strongly

weakly informative reparameterisations for location-scale mixtures

Posted in Books, pictures, R, Statistics, University life with tags , , , , , , on January 19, 2017 by xi'an

fitted_density_galaxy_data_500itersWe have been working towards a revision of our reparameterisation paper for quite a while now and too advantage of Kate Lee visiting Paris this fortnight to make a final round: we have now arXived (and submitted) the new version. The major change against the earlier version is the extension of the approach to a large class of models that include infinitely divisible distributions, compound Gaussian, Poisson, and exponential distributions, and completely monotonic densities. The concept remains identical: change the parameterisation of a mixture from a component-wise decomposition to a construct made of the first moment(s) of the distribution and of component-wise objects constrained by the moment equation(s). There is of course a bijection between both parameterisations, but the constraints appearing in the latter produce compact parameter spaces for which (different) uniform priors can be proposed. While the resulting posteriors are no longer conjugate, even conditional on the latent variables, standard Metropolis algorithms can be implemented to produce Monte Carlo approximations of these posteriors.

ratio-of-uniforms [#2]

Posted in Books, R, Statistics with tags , , , , , , on October 31, 2016 by xi'an

Following my earlier post on Kinderman’s and Monahan’s (1977) ratio-of-uniform method, I must confess I remain quite puzzled by the approach. Or rather by its consequences. When looking at the set A of (u,v)’s in R⁺×X such that 0≤u²≤ƒ(v/u), as discussed in the previous post, it can be represented by its parameterised boundary

u(x)=√ƒ(x),v(x)=x√ƒ(x)    x in X

Similarly, since the simulation from ƒ(v/u) can also be derived [check Luc Devroye’s Non-uniform random variate generation in the exercise section 7.3] from a uniform on the set B of (u,v)’s in R⁺×X such that 0≤u≤ƒ(v+u), on the set C of (u,v)’s in R⁺×X such that 0≤u³≤ƒ(v/√u)², or on the set D of (u,v)’s in R⁺×X such that 0≤u²≤ƒ(v/u), which is actually exactly the same as A [and presumably many other versions!, for which I would like to guess the generic rule of construction], there are many sets on which one can consider running simulations. And one to pick for optimality?! Here are the three sets for a mixture of two normal densities:

For instance, assuming slice sampling is feasible on every one of those three sets, which one is the most efficient? While I have no clear answer to this question, I found on Sunday night that a generic family of transforms is indexed by a differentiable  monotone function h over the positive half-line, with the uniform distribution being taken over the set

H={(u,v);0≤u≤h(f(v/g(u))}

when the primitive G of g is the inverse of h, i.e., G(h(x))=x. [Here are the slides I gave at the Warwick reading group on Devroye’s book last week:]

A Milano [not jatp]

Posted in Kids, Mountains, pictures, Statistics, Travel, University life, Wines with tags , , , , , , , , on October 7, 2016 by xi'an

Today, I went to Milano for 13 hours to give a seminar at l’Università Bocconi. Where I thus gave a talk on Testing via mixtures (using the same slides as at ISBA last Spring). It was the first time I was in Milano (and thus at Bocconi) for more than a transfer to MCMski or to Pavia and it was great to walk through the city. And of course to meet and share with many friends there. While I glimpsed the end of the sunrise on the Italian Alps (near Monte Rosa?!), I was too late on my way back for the sunset.

Savage-Dickey supermodels

Posted in Books, Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , on September 13, 2016 by xi'an

The Wider Image: Bolivia's cholita climbers: Combination picture shows Aymara indigenous women (L-R) Domitila Alana, 42, Bertha Vedia, 48, Lidia Huayllas, 48, and Dora Magueno, 50, posing for a photograph at the Huayna Potosi mountain, Bolivia April 6, 2016Combination picture shows Aymara indigenous women (L-R) Domitila Alana, 42, Bertha Vedia, 48, Lidia Huayllas, 48, and Dora Magueno, 50, posing for a photograph at the Huayna Potosi mountain, Bolivia April 6, 2016. (c.) REUTERS/David Mercado. REUTERS/David MercadoA. Mootoovaloo, B. Bassett, and M. Kunz just arXived a paper on the computation of Bayes factors by the Savage-Dickey representation through a supermodel (or encompassing model). (I wonder why Savage-Dickey is so popular in astronomy and cosmology statistical papers and not so much elsewhere.) Recall that the trick is to write the Bayes factor in favour of the encompasssing model as the ratio of the posterior and of the prior for the tested parameter (thus eliminating nuisance or common parameters) at its null value,

B10=π(φ⁰|x)/π(φ⁰).

Modulo some continuity constraints on the prior density, and the assumption that the conditional prior on nuisance parameter is the same under the null model and the encompassing model [given the null value φ⁰]. If this sounds confusing or even shocking from a mathematical perspective, check the numerous previous entries on this topic on the ‘Og!

The supermodel created by the authors is a mixture of the original models, as in our paper, and… hold the presses!, it is a mixture of the likelihood functions, as in Phil O’Neill’s and Theodore Kypraios’ paper. Which is not mentioned in the current paper and should obviously be. In the current representation, the posterior distribution on the mixture weight α is a linear function of α involving both evidences, α(m¹-m²)+m², times the artificial prior on α. The resulting estimator of the Bayes factor thus shares features with bridge sampling, reversible jump, and the importance sampling version of nested sampling we developed in our Biometrika paper. In addition to O’Neill and Kypraios’s solution.

The following quote is inaccurate since the MCMC algorithm needs simulating the parameters of the compared models in realistic settings, hence representing the multidimensional integrals by Monte Carlo versions.

“Though we have a clever way of avoiding multidimensional integrals to calculate the Bayesian Evidence, this new method requires very efficient sampling and for a small number of dimensions is not faster than individual nested sampling runs.”

I actually wonder at the sheer rationale of running an intensive MCMC sampler in such a setting, when the weight α is completely artificial. It is only used to jump from one model to the next, which sound quite inefficient when compared with simulating from both models separately and independently. This approach can also be seen as a special case of Carlin’s and Chib’s (1995) alternative to reversible jump. Using instead the Savage-Dickey representation is of course infeasible. Which makes the overall reference to this method rather inappropriate in my opinion. Further, the examples processed in the paper all involve (natural) embedded models where the original Savage-Dickey approach applies. Creating an additional model to apply a pseudo-Savage-Dickey representation does not sound very compelling…

Incidentally, the paper also includes a discussion of a weird notion, the likelihood of the Bayes factor, B¹², which is plotted as a distribution in B¹², most strangely. The only other place I met this notion is in Murray Aitkin’s book. Something’s unclear there or in my head!

“One of the fundamental choices when using the supermodel approach is how to deal with common parameters to the two models.”

This is an interesting question, although maybe not so relevant for the Bayes factor issue where it should not matter. However, as in our paper, multiplying the number of parameters in the encompassing model may hinder convergence of the MCMC chain or reduce the precision of the approximation of the Bayes factor. Again, from a Bayes factor perspective, this does not matter [while it does in our perspective].

MDL multiple hypothesis testing

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , on September 1, 2016 by xi'an

“This formulation reveals an interesting connection between multiple hypothesis testing and mixture modelling with the class labels corresponding to the accepted hypotheses in each test.”

After my seminar at Monash University last Friday, David Dowe pointed out to me the recent work by Enes Makalic and Daniel Schmidt on minimum description length (MDL) methods for multiple testing as somewhat related to our testing by mixture paper. Work which appeared in the proceedings of the 4th Workshop on Information Theoretic Methods in Science and Engineering (WITMSE-11), that took place in Helsinki, Finland, in 2011. Minimal encoding length approaches lead to choosing the model that enjoys the smallest coding length. Connected with, e.g., Rissannen‘s approach. The extension in this paper consists in considering K hypotheses at once on a collection of m datasets (the multiple then bears on the datasets rather than on the hypotheses). And to associate an hypothesis index to each dataset. When the objective function is the sum of (generalised) penalised likelihoods [as in BIC], it leads to selecting the “minimal length” model for each dataset. But the authors introduce weights or probabilities for each of the K hypotheses, which indeed then amounts to a mixture-like representation on the exponentiated codelengths. Which estimation by optimal coding was first proposed by Chris Wallace in his book. This approach eliminates the model parameters at an earlier stage, e.g. by maximum likelihood estimation, to return a quantity that only depends on the model index and the data. In fine, the purpose of the method differs from ours in that the former aims at identifying an appropriate hypothesis for each group of observations, rather than ranking those hypotheses for the entire dataset by considering the posterior distribution of the weights in the later. The mixture has somehow more of a substance in the first case, where separating the datasets into groups is part of the inference.