Archive for mixtures of distributions

1500 nuances of gan [gan gan style]

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on February 16, 2018 by xi'an

I recently realised that there is a currently very popular trend in machine learning called GAN [for generative adversarial networks] that strongly connects with ABC, at least in that it relies mostly on the availability of a generative model, i.e., a probability model that can be generated as in x=G(ϵ;θ), to draw inference about θ [or predictions]. For instance, there was a GANs tutorial at NIPS 2016 by Ian Goodfellow and many talks on the topic at recent NIPS, the 1500 in the title referring to the citations of the GAN paper by Goodfellow et al. (2014). (The name adversarial comes from opposing true model to generative model in the inference. )

If you remember Jeffreys‘s famous pique about classical tests as being based on improbable events that did not happen, GAN, like ABC,  is sort of the opposite in that it generates events until the one that was observed happens. More precisely, by generating pseudo-samples and switching parameters θ until these samples get as confused as possible between the data generating (“true”) distribution and the generative one. (In its original incarnation, GAN is indeed an optimisation scheme in θ.) A basic presentation of GAN is that it constructs a function D(x,ϕ) that represents the probability that x came from the true model p versus the generative model, ϕ being the parameter of a neural network trained to this effect, aimed at minimising in ϕ a two-term objective function

E[log D(x,ϕ)]+E[log(1D(G(ϵ;θ),ϕ))]

where the first expectation is taken under the true model and the second one under the generative model.

“The discriminator tries to best distinguish samples away from the generator. The generator tries to produce samples that are indistinguishable by the discriminator.” Edward

One ABC perception of this technique is that the confusion rate

E[log(1D(G(ϵ;θ),ϕ))]

is a form of distance between the data and the generative model. Which expectation can be approximated by repeated simulations from this generative model. Which suggests an extension from the optimisation approach to a ABCyesian version by selecting the smallest distances across a range of θ‘s simulated from the prior.

This notion relates to solution using classification tools as density ratio estimation, connecting for instance to Gutmann and Hyvärinen (2012). And ultimately with Geyer’s 1992 normalising constant estimator.

Another link between ABC and networks also came out during that trip. Proposed by Bishop (1994), mixture density networks (MDN) are mixture representations of the posterior [with component parameters functions of the data] trained on the prior predictive through a neural network. These MDNs can be trained on the ABC learning table [based on a specific if redundant choice of summary statistics] and used as substitutes to the posterior distribution, which brings an interesting alternative to Simon Wood’s synthetic likelihood. In a paper I missed Papamakarios and Murray suggest replacing regular ABC with this version…

LaTeX issues from Vienna

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on September 21, 2017 by xi'an

When working on the final stage of our edited handbook on mixtures, in Vienna, I came across unexpected practical difficulties! One was that by working on Dropbox with Windows users, files and directories names suddenly switched from upper case to lower cases letters !, making hard-wired paths to figures and subsections void in the numerous LaTeX files used for the book. And forcing us to change to lower cases everywhere. Having not worked under Windows since George Casella gave me my first laptop in the mid 90’s!, I am amazed that this inability to handle both upper and lower names is still an issue. And that Dropbox replicates it. (And that some people see that as a plus.)

The other LaTeX issue that took a while to solve was that we opted for one chapter one bibliography, rather than having a single bibliography at the end of the book, mainly because CRC Press asked for this feature in order to sell chapters individually… This was my first encounter with this issue and I found the solutions to produce individual bibliographies incredibly heavy handed, whether through chapterbib or bibunits, since one has to bibtex one .aux file for each chapter. Even with a one line bash command,

for f in bu*aux; do bibtex `basename $f .aux`; done

this is annoying in the extreme!

zurück nach Wien

Posted in pictures, Running, Statistics, Travel, University life, Wines with tags , , , , , , , , on September 16, 2017 by xi'an

Today, I am travelling to Vienna for a few days, primarily for assessing a grant renewal for a research consortium federating most Austrian research groups on a topic for which Austria is a world-leader. (Sorry for being cryptic but I am unsure how much I can disclose about this assessment!) And taking advantage on being in Vienna, for a two-day editing session with Sylvia Früwirth-Schnatter and Gilles Celeux on our Handbook of mixtures analysis project. Which started a few years ago with another meeting in Vienna. And taking further advantage on being in Vienna, for an evening at the Volksoper, conveniently playing Die Zauberflöte!

Jeffreys priors for mixtures [or not]

Posted in Books, Statistics, University life with tags , , , , , on July 25, 2017 by xi'an

Clara Grazian and I have just arXived [and submitted] a paper on the properties of Jeffreys priors for mixtures of distributions. (An earlier version had not been deemed of sufficient interest by Bayesian Analysis.) In this paper, we consider the formal Jeffreys prior for a mixture of Gaussian distributions and examine whether or not it leads to a proper posterior with a sufficient number of observations.  In general, it does not and hence cannot be used as a reference prior. While this is a negative result (and this is why Bayesian Analysis did not deem it of sufficient importance), I find it definitely relevant because it shows that the default reference prior [in the sense that the Jeffreys prior is the primary choice in nonparametric settings] does not operate in this wide class of distributions. What is surprising is that the use of a Jeffreys-like prior on a global location-scale parameter (as in our 1996 paper with Kerrie Mengersen or our recent work with Kaniav Kamary and Kate Lee) remains legit if proper priors are used on all the other parameters. (This may be yet another illustration of the tequilla-like toxicity of mixtures!)

Francisco Rubio and Mark Steel already exhibited this difficulty of the Jeffreys prior for mixtures of densities with disjoint supports [which reveals the mixture latent variable and hence turns the problem into something different]. Which relates to another point of interest in the paper, derived from a 1988 [Valencià Conference!] paper by José Bernardo and Javier Giròn, where they show the posterior associated with a Jeffreys prior on a mixture is proper when (a) only estimating the weights p and (b) using densities with disjoint supports. José and Javier use in this paper an astounding argument that I had not seen before and which took me a while to ingest and accept. Namely, the Jeffreys prior on a observed model with latent variables is bounded from above by the Jeffreys prior on the corresponding completed model. Hence if the later leads to a proper posterior for the observed data, so does the former. Very smooth, indeed!!!

Actually, we still support the use of the Jeffreys prior but only for the mixture mixtures, because it has the property supported by Judith and Kerrie of a conservative prior about the number of components. Obviously, we cannot advocate its use over all the parameters of the mixture since it then leads to an improper posterior.

exciting week[s]

Posted in Mountains, pictures, Running, Statistics with tags , , , , , , , , , , , , , , on June 27, 2017 by xi'an

The past week was quite exciting, despite the heat wave that hit Paris and kept me from sleeping and running! First, I made a two-day visit to Jean-Michel Marin in Montpellier, where we discussed the potential Peer Community In Computational Statistics (PCI Comput Stats) with the people behind PCI Evol Biol at INRA, Hopefully taking shape in the coming months! And went one evening through a few vineyards in Saint Christol with Jean-Michel and Arnaud. Including a long chat with the owner of Domaine Coste Moynier. [Whose domain includes the above parcel with views of Pic Saint-Loup.] And last but not least! some work planning about approximate MCMC.

On top of this, we submitted our paper on ABC with Wasserstein distances [to be arXived in an extended version in the coming weeks], our revised paper on ABC consistency thanks to highly constructive and comments from the editorial board, which induced a much improved version in my opinion, and we received a very positive return from JCGS for our paper on weak priors for mixtures! Next week should be exciting as well, with BNP 11 taking place in downtown Paris, at École Normale!!!

parameter space for mixture models

Posted in Statistics, University life with tags , , , on March 24, 2017 by xi'an

“The paper defines a new solution to the problem of defining a suitable parameter space for mixture models.”

When I received the table of contents of the incoming Statistics & Computing and saw a paper by V. Maroufy and P. Marriott about the above, I was quite excited about a new approach to mixture parameterisation. Especially after our recent reposting of the weakly informative reparameterisation paper. Alas, after reading the paper, I fail to see the (statistical) point of the whole exercise.

Starting from the basic fact that mixtures face many identifiability issues, not only invariance by component permutation, but the possibility to add spurious components as well, the authors move to an entirely different galaxy by defining mixtures of so-called local mixtures. Developed by one of the authors. The notion is just incomprehensible for me: the object is a weighted sum of the basic component of the original mixture, e.g., a Normal density, and of k of its derivatives wrt its mean, a sort of parameterised Taylor expansion. Which implies the parameter is unidimensional, incidentally. The weights of this strange mixture are furthermore constrained by the positivity of the resulting mixture, a constraint that seems impossible to satisfy in the Normal case when the number of derivatives is odd. And hard to analyse in any case since possibly negative components do not enjoy an interpretation as a probability density. In exponential families, the local mixture is the original exponential family density multiplied by a polynomial. The current paper moves one step further [from the reasonable] by considering mixtures [in the standard sense] of such objects. Which components are parameterised by their mean parameter and a collection of weights. The authors then restrict the mean parameters to belong to a finite and fixed set, which elements are coerced by a maximum error rate on any compound distribution derived from this exponential family structure. The remainder of the paper discusses of the choice of the mean parameters and of an EM algorithm to estimate the parameters, with a confusing lower bound on the mixture weights that impacts the estimation of the weights. And no mention made of the positivity constraint. I remain completely bemused by the paper and its purpose: I do not even fathom how this qualifies as a mixture.

a response by Ly, Verhagen, and Wagenmakers

Posted in Statistics with tags , , , , , , , , on March 9, 2017 by xi'an

Following my demise [of the Bayes factor], Alexander Ly, Josine Verhagen, and Eric-Jan Wagenmakers wrote a very detailed response. Which I just saw the other day while in Banff. (If not in Schiphol, which would have been more appropriate!)

“In this rejoinder we argue that Robert’s (2016) alternative view on testing has more in common with Jeffreys’s Bayes factor than he suggests, as they share the same ‘‘shortcomings’’.”

Rather unsurprisingly (!), the authors agree with my position on the dangers to ignore decisional aspects when using the Bayes factor. A point of dissension is the resolution of the Jeffreys[-Lindley-Bartlett] paradox. One consequence derived by Alexander and co-authors is that priors should change between testing and estimating. Because the parameters have a different meaning under the null and under the alternative, a point I agree with in that these parameters are indexed by the model [index!]. But with which I disagree when arguing that the same parameter (e.g., a mean under model M¹) should have two priors when moving from testing to estimation. To state that the priors within the marginal likelihoods “are not designed to yield posteriors that are good for estimation” (p.45) amounts to wishful thinking. I also do not find a strong justification within the paper or the response about choosing an improper prior on the nuisance parameter, e.g. σ, with the same constant. Another a posteriori validation in my opinion. However, I agree with the conclusion that the Jeffreys paradox prohibits the use of an improper prior on the parameter being tested (or of the test itself). A second point made by the authors is that Jeffreys’ Bayes factor is information consistent, which is correct but does not solved my quandary with the lack of precise calibration of the object, namely that alternatives abound in a non-informative situation.

“…the work by Kamary et al. (2014) impressively introduces an alternative view on testing, an algorithmic resolution, and a theoretical justification.”

The second part of the comments is highly supportive of our mixture approach and I obviously appreciate very much this support! Especially if we ever manage to turn the paper into a discussion paper! The authors also draw a connection with Harold Jeffreys’ distinction between testing and estimation, based upon Laplace’s succession rule. Unbearably slow succession law. Which is well-taken if somewhat specious since this is a testing framework where a single observation can send the Bayes factor to zero or +∞. (I further enjoyed the connection of the Poisson-versus-Negative Binomial test with Jeffreys’ call for common parameters. And the supportive comments on our recent mixture reparameterisation paper with Kaniav Kamari and Kate Lee.) The other point that the Bayes factor is more sensitive to the choice of the prior (beware the tails!) can be viewed as a plus for mixture estimation, as acknowledged there. (The final paragraph about the faster convergence of the weight α is not strongly