Archive for Wasserstein distance

ABC in Svalbard [#2]

Posted in Books, Mountains, pictures, R, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , , , , on April 14, 2021 by xi'an

The second day of the ABC wwworkshop got a better start than yesterday [for me] as I managed to bike to Dauphine early enough to watch the end of Gael’s talk and Matias Quiroz’ in full on the Australian side (of zoom). With an interesting take on using frequency-domain (pseudo-)likelihoods in complex models. Followed by two talks by David Frazier from Monash and Chris Drovandi from Brisbane on BSL, the first on misspecification with a finer analysis as to why synthetic likelihood may prove worse: the Mahalanobis distance behind it may get very small and the predictive distribution of the distance may become multimodal. Also pointing out the poor coverage of both ABC and BSL credible intervals. And Chris gave a wide-ranging coverage of summary-free likelihood-free approaches, with examples where they were faring well against some summary-based solutions. Olivier from Grenoble [with a co-author from Monash, keeping up the Australian theme] discussed dimension reductions which could possibly lead to better summary statistics, albeit unrelated with ABC!

Riccardo Corradin considered this most Milanese problem of all problems (!), namely how to draw inference on completely random distributions. The clustering involved in this inference being costly, the authors using our Wasserstein ABC approach on the partitions, with a further link to our ABC-Gibbs algorithm (which Grégoire had just presented) for the tolerance selection. Marko Järvenpää presented an approach related with a just-published paper in Bayesian Analysis. with a notion of noisy likelihood modelled as a Gaussian process. Towards avoiding evaluating the (greedy) likelihood too often, as in the earlier Korrakitara et al. (2014). And coining the term of Bayesian Metropolis-Hastings sampler (as the regular Metropolis (Rosenbluth) is frequentist)! And Pedro Rodrigues discussed using normalising flows in poorly identified (or inverse) models. Raising the issue of validating this approximation to the posterior and connecting with earlier talks.

The afternoon session was a reply of the earliest talks from the Australian mirrors. Clara Grazian gave the first talk yesterday on using and improving a copula-based ABC, introducing empirical likelihood, Gaussian processes and splines. Leading to a question as to whether or not the copula family could be chosen by ABC tools. David Nott raised the issue of conflicting summary statistics. Illustrated by a Poisson example where using the pair made by the empirical mean and the empirical variance  as summary: while the empirical mean is sufficient, conditioning on both leads to a different ABC outcome. Which indirectly relates to a work in progress in our Dauphine group. Anthony Ebert discussed the difficulty of handling state space model parameters with ABC. In an ABCSMC² version, the likelihood is integrated out by a particle filter approximation but leading to difficulties with the associated algorithm, which I somewhat associate with the discrete nature of the approximation, possibly incorrectly. Jacob Priddle’s talked about a whitening version of Bayesian synthetic likelihood. By arguing that the variance of the Monte Carlo approximation to the moments of the Normal synthetic likelihood is much improved when assuming that the components of the summary statistic are independent. I am somewhat puzzled by the proposal, though, in that the whitening matrix need be estimated as well.

Thanks to all colleagues and friends involved in building and running the mirrors and making some exchanges possible despite the distances and time differences! Looking forward a genuine ABC meeting in a reasonable future, and who knows?!, reuniting in Svalbard for real! (The temperature in Longyearbyen today was -14⁰, if this makes some feel better about missing the trip!!!) Rather than starting a new series of “ABC not in…”

likelihood-free and summary-free?

Posted in Books, Mountains, pictures, Statistics, Travel with tags , , , , , , , , , , , , , on March 30, 2021 by xi'an

My friends and coauthors Chris Drovandi and David Frazier have recently arXived a paper entitled A comparison of likelihood-free methods with and without summary statistics. In which they indeed compare these two perspectives on approximate Bayesian methods like ABC and Bayesian synthetic likelihoods.

“A criticism of summary statistic based approaches is that their choice is often ad hoc and there will generally be an  inherent loss of information.”

In ABC methods, the recourse to a summary statistic is often advocated as a “necessary evil” against the greater evil of the curse of dimension, paradoxically providing a faster convergence of the ABC approximation (Fearnhead & Liu, 2018). The authors propose a somewhat generic selection of summary statistics based on [my undergrad mentors!] Gouriéroux’s and Monfort’s indirect inference, using a mixture of Gaussians as their auxiliary model. Summary-free solutions, as in our Wasserstein papers, rely on distances between distributions, hence are functional distances, that can be seen as dimension-free as well (or criticised as infinite dimensional). Chris and David consider energy distances (which sound very much like standard distances, except for averaging over all permutations), maximum mean discrepancy as in Gretton et al. (2012), Cramèr-von Mises distances, and Kullback-Leibler divergences estimated via one-nearest-neighbour formulas, for a univariate sample. I am not aware of any degree of theoretical exploration of these functional approaches towards the precise speed of convergence of the ABC approximation…

“We found that at least one of the full data approaches was competitive with or outperforms ABC with summary statistics across all examples.”

The main part of the paper, besides a survey of the existing solutions, is to compare the performances of these over a few chosen (univariate) examples, with the exact posterior as the golden standard. In the g & k model, the Pima Indian benchmark of ABC studies!, Cramèr does somewhat better. While it does much worse in an M/G/1 example (where Wasserstein does better, and similarly for a stereological extremes example of Bortot et al., 2007). An ordering inversed again for a toad movement model I had not seen before. While the usual provision applies, namely that this is a simulation study on unidimensional data and a small number of parameters, the design of the four comparison experiments is very careful, eliminating versions that are either too costly or too divergence, although this could be potentially criticised for being unrealistic (i.e., when the true posterior is unknown). The computing time is roughly the same across methods, which essentially remove the call to kernel based approximations of the likelihood. Another point of interest is that the distance methods are significantly impacted by transforms on the data, which should not be so for intrinsic distances! Demonstrating the distances are not intrinsic…


Posted in Statistics with tags , , , , , , , on March 24, 2021 by xi'an

An AISTATS 2021 paper by Masahiro Fujisawa,Takeshi Teshima, Issei Sato and Masashi Sugiyama (RIKEN, Tokyo) just appeared on arXiv.  (AISTATS 2021 is again virtual this year.)

“ABC can be sensitive to outliers if a data discrepancy measure is chosen inappropriately (…) In this paper, we propose a novel outlier-robust and computationally-efficient discrepancy measure based on the γ-divergence”

The focus is on measure of robustness for ABC distances as those can be lethal if insufficient summarisation is used. (Note that a referenced paper by Erlis Ruli, Nicola Sartori and Laura Ventura from Padova appeared last year on robust ABC.) The current approach mixes the γ-divergence of Fujisawa and Eguchi, with a k-nearest neighbour density estimator. Which may not prove too costly, of order O(n log n), but also may be a poor if robust approximation, even if it provides an asymptotic unbiasedness and almost surely convergent approximation. These properties are those established in the paper, which only demonstrates convergence in the sample size n to an ABC approximation with the true γ-divergence but with a fixed tolerance ε, when the most recent results are rather concerned with the rates of convergence of ε(n) to zero. (An extensive simulation section compares this approach with several ABC alternatives, incl. ours using the Wasserstein distance. If I read the comparison graphs properly, it does not look as if there is a huge discrepancy between the two approaches under no contamination.) Incidentally, the paper contains a substantial survey section and has a massive reference list, if missing the publication more than a year earlier of our Wasserstein paper in Series B.

your GAN is secretly an energy-based model

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , on January 5, 2021 by xi'an

As I was reading this NeurIPS 2020 paper by Che et al., and trying to make sense of it, I came across a citation to our paper Casella, Robert and Wells (2004) on a generalized accept-reject sampling scheme where the proposal changes at each simulation that sounds surprising if appreciated! But after checking this paper also appears as the first reference on the Wikipedia page for rejection sampling, which makes me wonder if many actually read it. (On the side, we mostly wrote this paper on a drive from Baltimore to Ithaca, after JSM 1999.)

“We provide more evidence that it is beneficial to sample from the energy-based model defined both by the generator and the discriminator instead of from the generator only.”

The paper seems to propose a post-processing of the generator output by a GAN, generating from the mixture of both generator and discriminator, via a (unscented) Langevin algorithm. The core idea is that, if p(.) is the true data generating process, g(.) the estimated generator and d(.) the discriminator, then

p(x) ≈ p⁰(x)∝g(x) exp(d(x))

(The approximation would be exact the discriminator optimal.) The authors work with the latent z’s, in the GAN meaning that generating pseudo-data x from g means taking a deterministic transform of z, x=G(z). When considering the above p⁰, a generation from p⁰ can be seen as accept-reject with acceptance probability proportional to exp[d{G(z)}]. (On the side, Lemma 1 is the standard validation for accept-reject sampling schemes.)

Reading this paper made me realised how much the field had evolved since my previous GAN related read. With directions like Metropolis-Hastings GANs and Wasserstein GANs. (And I noticed a “broader impact” section past the conclusion section about possible misuses with societal consequences, which is a new requirement for NeurIPS publications.)

maximal couplings of the Metropolis-Hastings algorithm

Posted in Statistics, University life with tags , , , , , , , , , on November 17, 2020 by xi'an

As a sequel to their JRSS B paper, John O’Leary, Guanyang Wang, and [my friend, co-author and former student!] Pierre E. Jacob have recently posted a follow-up paper on maximal coupling for Metropolis-Hastings algorithms, where maximal is to be understood in terms of the largest possible probability for the coupled chains to be equal, according to the bound set by the coupling inequality. It made me realise that there is a heap of very recent works in this area.

A question that came up when reading the paper with our PhD students is whether or not the coupled chains stay identical after meeting once. When facing two different targets this seems inevitable and indeed Lemma 2 seems to show that no. A strong lemma that does not [need to] state what happens outside the diagonal Δ.

One of the essential tricks is to optimise several kinds of maximal coupling, incl. one for the Bernoullesque choice of moving, as given on p.3.

Algorithm 1 came as a novelty to me as it first seemed (to me!) the two chains may never meet, but this was before I read the small prints of the transition (proposal) kernel being maximally coupled with itself. While Algorithm 2 may be the earliest example of Metropolis-Hastings coupling I have seen, namely in 1999 in Crete, in connection with a talk by Laird Breyer and Gareth Roberts at a workshop of our ESSS network. As explained by the authors, this solution is not always a maximal coupling for the reason that

min(q¹.q²) min(α¹,α²) ≤ min(q¹α¹,q²α²)

(with q for the transition kernel and α for the acceptance probability). Lemma 1 is interesting in that it describes the probability to un-meet (!) as the surface between one of the move densities and the minimum of the two.

The first solution is to couple by plain Accept-Reject with the first chain being the proposed value and if rejected [i.e. not in C] to generate from the remainder or residual of the second target, in a form of completion of acceptance-rejection (accept when above rather than below, i.e. in A or A’). This can be shown to be a maximal coupling. Another coupling using reflection residuals works better but requires some spherical structure in the kernel. A further coupling on the acceptance of the Metropolis-Hastings move seems to bring an extra degree of improvement.

In the introduction, the alternatives about the acceptance probability α(·,·), e.g. Metropolis-Hastings versus Barker, are mentioned but would it make a difference to the preferred maximal coupling when using one or the other?

A further comment is that, in larger dimensions, I mean larger than one!, a Gibbsic form of coupling could be considered. In which case it would certainly decrease the coupling probability but may still speed up the overall convergence by coupling more often. See “maximality is sometimes less important than other properties of a coupling, such as the contraction behavior when a meeting does not occur.” (p.8)

As a final pun, I noted that Vaserstein is not a typo, as Leonid Vaseršteĭn is a Russian-American mathematician, currently at Penn State.