Archive for Wasserstein distance

Implicit maximum likelihood estimates

Posted in Statistics with tags , , , , , , , , , , on October 9, 2018 by xi'an

An ‘Og’s reader pointed me to this paper by Li and Malik, which made it to arXiv after not making it to NIPS. While the NIPS reviews were not particularly informative and strongly discordant, the authors point out in the comments that they are available for the sake of promoting discussion. (As made clear in earlier posts, I am quite supportive of this attitude! Disclaimer: I was not involved in an evaluation of this paper, neither for NIPS nor for another conference or journal!!) Although the paper does not seem to mention ABC in the setting of implicit likelihoods and generative models, there is a reference to the early (1984) paper by Peter Diggle and Richard Gratton that is often seen as the ancestor of ABC methods. The authors point out numerous issues with solutions proposed for parameter estimation in such implicit models. For instance, for GANs, they signal that “minimizing the Jensen-Shannon divergence or the Wasserstein distance between the empirical data distribution and the model distribution does not necessarily minimize the same between the true data distribution and the model distribution.” (Not mentioning the particular difficulty with Bayesian GANs.) Their own solution is the implicit maximum likelihood estimator, which picks the value of the parameter θ bringing a simulated sample the closest to the observed sample. Closest in the sense of the Euclidean distance between both samples. Or between the minimum of several simulated samples and the observed sample. (The modelling seems to imply the availability of n>1 observed samples.) They advocate using a stochastic gradient descent approach for finding the optimal parameter θ which presupposes that the dependence between θ and the simulated samples is somewhat differentiable. (And this does not account for using a min, which would make differentiation close to impossible.) The paper then meanders in a lengthy discussion as to whether maximising the likelihood makes sense, with a rather naïve view on why using the empirical distribution in a Kullback-Leibler divergence does not make sense! What does not make sense is considering the finite sample approximation to the Kullback-Leibler divergence with the true distribution in my opinion.

sliced Wasserstein estimation of mixtures

Posted in Books, pictures, R, Statistics with tags , , , , , , on November 28, 2017 by xi'an

A paper by Soheil Kolouri and co-authors was arXived last week about using Wasserstein distance for inference on multivariate Gaussian mixtures. The basic concept is that the parameter is estimated by minimising the p-Wasserstein distance to the empirical distribution, smoothed by a Normal kernel. As the general Wasserstein distance is quite costly to compute, the approach relies on a sliced version, which means computing the Wasserstein distance between one-dimensional projections of the distributions. Optimising over the directions is an additional computational constraint.

“To fit a finite GMM to the observed data, one is required to answer the following questions: 1) how to estimate the number of mixture components needed to represent the data, and 2) how to estimate the parameters of the mixture components.”

The paper contains a most puzzling comment opposing maximum likelihood estimation to minimum Wasserstein distance estimation on the basis that the later would not suffer from multimodality. This sounds incorrect as the multimodality of a mixture model (likelihood) stems from the lack of identifiability of the parameters. If all permutations of these parameters induce exactly the same distribution, they all stand at the same distance from the data distribution, whatever the distance is. Furthermore, the above tartan-like picture clashes with the representation of the log-likelihood of a Normal mixture, as exemplified by the picture below based on a 150 sample with means 0 and 2, same unit variance, and weights 0.3 and 0.7, which shows a smooth if bimodal structure:And for the same dataset, my attempt at producing a Wasserstein “energy landscape” does return a multimodal structure (this is the surface of minus the logarithm of the 2-Wasserstein distance):“Jin et al. proved that with random initialization, the EM algorithm will converge to a bad critical point with high probability.”

This statement is most curious in that the “probability” in the assessment must depend on the choice of the random initialisation, hence on a sort of prior distribution that is not explicited in the paper. Which remains blissfully unaware of Bayesian approaches.

Another [minor mode] puzzling statement is that the p-Wasserstein distance is defined on the space of probability measures with finite p-th moment, which does not make much sense when what matters is rather the finiteness of the expectation of the distance d(X,Y) raised to the power p. A lot of the maths details either do not make sense or seem superfluous.

Langevin on a wrong bend

Posted in Books, Statistics with tags , , , , , , , on October 19, 2017 by xi'an

Arnak Dalayan and Avetik Karagulyan (CREST) arXived a paper the other week on a focussed study of the Langevin algorithm [not MALA] when the gradient of the target is incorrect. With the following improvements [quoting non-verbatim from the paper]:

  1. a varying-step Langevin that reduces the number of iterations for a given Wasserstein precision, compared with recent results by e.g. Alan Durmus and Éric Moulines;
  2. an extension of convergence results for error-prone evaluations of the gradient of the target (i.e., the gradient is replaced with a noisy version, under some moment assumptions that do not include unbiasedness);
  3. a new second-order sampling algorithm termed LMCO’, with improved convergence properties.

What is particularly interesting to me in this setting is the use in all these papers of a discretised Langevin diffusion (a.k.a., random walk with a drift induced by the gradient of the log-target) without the original Metropolis correction. The results rely on an assumption of [strong?] log-concavity of the target, with “user-friendly” bounds on the Wasserstein distance depending on the constants appearing in this log-concavity constraint. And so does the adaptive step. (In the case of the noisy version, the bias and variance of the noise also matter. As pointed out by the authors, there is still applicability to scaling MCMC for large samples. Beyond pseudo-marginal situations.)

“…this, at first sight very disappointing behavior of the LMC algorithm is, in fact, continuously connected to the exponential convergence of the gradient descent.”

The paper concludes with an interesting mise en parallèle of Langevin algorithms and of gradient descent algorithms, since the convergence rates are the same.

g-and-k [or -h] distributions

Posted in Statistics with tags , , , , , , , , , on July 17, 2017 by xi'an

Dennis Prangle released last week an R package called gk and an associated arXived paper for running inference on the g-and-k and g-and-h quantile distributions. As should be clear from an earlier review on Karian’s and Dudewicz’s book quantile distributions, I am not particularly fond of those distributions which construction seems very artificial to me, as mostly based on the production of a closed-form quantile function. But I agree they provide a neat benchmark for ABC methods, if nothing else. However, as recently pointed out in our Wasserstein paper with Espen Bernton, Pierre Jacob and Mathieu Gerber, and explained in a post of Pierre’s on Statisfaction, the pdf can be easily constructed by numerical means, hence allows for an MCMC resolution, which is also a point made by Dennis in his paper. Using the closed-form derivation of the Normal form of the distribution [i.e., applied to Φ(x)] so that numerical derivation is not necessary.

at the Isaac Newton Institute [talks]

Posted in Statistics with tags , , , , , , , on July 7, 2017 by xi'an

Here are the slides I edited this week [from previous talks by Pierre and Epstein] for the INI Workshop on scalable inference, in connection with our recently completed and submitted paper on ABC with Wasserstein distances:

MCM 2017

Posted in Statistics with tags , , , , , , , , , , , , on July 3, 2017 by xi'an

And thus I am back in Montréal, for MCM 2017, located in HEC Montréal, on the campus of Université de Montréal, for three days. My talk is predictably about ABC, what else?!, gathering diverse threads from different talks and papers:

exciting week[s]

Posted in Mountains, pictures, Running, Statistics with tags , , , , , , , , , , , , , , on June 27, 2017 by xi'an

The past week was quite exciting, despite the heat wave that hit Paris and kept me from sleeping and running! First, I made a two-day visit to Jean-Michel Marin in Montpellier, where we discussed the potential Peer Community In Computational Statistics (PCI Comput Stats) with the people behind PCI Evol Biol at INRA, Hopefully taking shape in the coming months! And went one evening through a few vineyards in Saint Christol with Jean-Michel and Arnaud. Including a long chat with the owner of Domaine Coste Moynier. [Whose domain includes the above parcel with views of Pic Saint-Loup.] And last but not least! some work planning about approximate MCMC.

On top of this, we submitted our paper on ABC with Wasserstein distances [to be arXived in an extended version in the coming weeks], our revised paper on ABC consistency thanks to highly constructive and comments from the editorial board, which induced a much improved version in my opinion, and we received a very positive return from JCGS for our paper on weak priors for mixtures! Next week should be exciting as well, with BNP 11 taking place in downtown Paris, at École Normale!!!