**D**ennis Prangle released last week an R package called gk and an associated arXived paper for running inference on the g-and-k and g-and-h quantile distributions. As should be clear from an earlier review on Karian’s and Dudewicz’s book quantile distributions, I am not particularly fond of those distributions which construction seems very artificial to me, as mostly based on the production of a closed-form quantile function. But I agree they provide a neat benchmark for ABC methods, if nothing else. However, as recently pointed out in our Wasserstein paper with Espen Bernton, Pierre Jacob and Mathieu Gerber, and explained in a post of Pierre’s on Statisfaction, the pdf can be easily constructed by numerical means, hence allows for an MCMC resolution, which is also a point made by Dennis in his paper. Using the closed-form derivation of the Normal form of the distribution [i.e., applied to Φ(x)] so that numerical derivation is not necessary.

## Archive for Wasserstein distance

## g-and-k [or -h] distributions

Posted in Statistics with tags ABC, ABC de Sevilla, benchmark, Dennis Prangle, g-and-k distributions, MCMC, numerical derivation, numerical integration, quantile distribution, Wasserstein distance on July 17, 2017 by xi'an## at the Isaac Newton Institute [talks]

Posted in Statistics with tags ABC algorithm, dynamic model, empirical likelihood, INI, Isaac Newton Institute, non-i.i.d. data, summary statistics, Wasserstein distance on July 7, 2017 by xi'an**H**ere are the slides I edited this week [from previous talks by Pierre and Epstein] for the INI Workshop on scalable inference, in connection with our recently completed and submitted paper on ABC with Wasserstein distances:

## MCM 2017

Posted in Statistics with tags ABC, ABC algorithm, ABC consistency, Bayesian model choice, curse of dimensionality, Hilbert curve, MCM 2017, Montréal, population genetics, Québec, random forests, summary statistics, Wasserstein distance on July 3, 2017 by xi'an## exciting week[s]

Posted in Mountains, pictures, Running, Statistics with tags ABC, ABC validation, École Normale Supérieure, Bayesian nonparametrics, BNP11, Domaine Coste Moynier, Grés de Montpellier, mixtures of distributions, PCI Comput Stats, PCI Evol Biol, Peer Community, Pic Saint Loup, Saint Christol, Université de Montpellier, Wasserstein distance on June 27, 2017 by xi'an**T**he past week was quite exciting, despite the heat wave that hit Paris and kept me from sleeping and running! First, I made a two-day visit to Jean-Michel Marin in Montpellier, where we discussed the potential Peer Community In Computational Statistics (*PCI Comput Stats*) with the people behind PCI Evol Biol at INRA, Hopefully taking shape in the coming months! And went one evening through a few vineyards in Saint Christol with Jean-Michel and Arnaud. Including a long chat with the owner of Domaine Coste Moynier. *[Whose domain includes the above parcel with views of Pic Saint-Loup.]* And last but not least! some work planning about approximate MCMC.

On top of this, we submitted our paper on ABC with Wasserstein distances [to be arXived in an extended version in the coming weeks], our revised paper on ABC consistency thanks to highly constructive and comments from the editorial board, which induced a much improved version in my opinion, and we received a very positive return from JCGS for our paper on weak priors for mixtures! Next week should be exciting as well, with BNP 11 taking place in downtown Paris, at École Normale!!!

## automated ABC summary combination

Posted in Books, pictures, Statistics, University life with tags ABC, José Miguel Bernardo, Lasso, posterior distribution, semi-automatic ABC, summary statistics, University of Oxford, Wasserstein distance on March 16, 2017 by xi'an**J**onathan Harrison and Ruth Baker (Oxford University) arXived this morning a paper on the optimal combination of summaries for ABC in the sense of deriving the proper weights in an Euclidean distance involving all the available summaries. The idea is to find the weights that lead to the maximal distance between prior and posterior, in a way reminiscent of Bernardo’s (1979) maximal information principle. Plus a sparsity penalty à la Lasso. The associated algorithm is sequential in that the weights are updated at each iteration. The paper does not get into theoretical justifications but considers instead several examples with limited numbers of both parameters and summary statistics. Which may highlight the limitations of the approach in that handling (and eliminating) a large number of parameters may prove impossible this way, when compared with optimisation methods like random forests. Or summary-free distances between empirical distributions like the Wasserstein distance.

## X divergence for approximate inference

Posted in Statistics with tags adaptive importance sampling, divergence, expectation-propagation, Kullback-Leibler divergence, Pima Indians, population Monte Carlo, variational Bayes methods, Wasserstein distance on March 14, 2017 by xi'an**D**ieng et al. arXived this morning a new version of their paper on using the Χ divergence for variational inference. The Χ divergence essentially is the expectation of the squared ratio of the target distribution over the approximation, under the approximation. It is somewhat related to Expectation Propagation (EP), which aims at the Kullback-Leibler divergence between the target distribution and the approximation, under the target. And to variational Bayes, which is the same thing just the opposite way! The authors also point a link to our [adaptive] population Monte Carlo paper of 2008. (I wonder at a possible version through Wasserstein distance.)

Some of the arguments in favour of this new version of variational Bayes approximations is that (a) the support of the approximation over-estimates the posterior support; (b) it produces over-dispersed versions; (c) it relates to a well-defined and global objective function; (d) it allows for a sandwich inequality on the model evidence; (e) the function of the [approximation] parameter to be minimised is under the approximation, rather than under the target. The latest allows for a gradient-based optimisation. While one of the applications is on a Bayesian probit model applied to the Pima Indian women dataset [and will thus make James and Nicolas cringe!], the experimental assessment shows lower error rates for this and other benchmarks. Which in my opinion does not tell so much about the original Bayesian approach.

## ABC with kernelised regression

Posted in Mountains, pictures, Statistics, Travel, University life with tags 17w5025, ABC, Approximate Bayesian computation, Banff, dimension reduction, Fourier transform, ICML, reproducing kernel Hilbert space, ridge regression, RKHS, summary statistics, Wasserstein distance on February 22, 2017 by xi'an**T**he exact title of the paper by Jovana Metrovic, Dino Sejdinovic, and Yee Whye Teh is DR-ABC: Approximate Bayesian Computation with Kernel-Based Distribution Regression. It appeared last year in the proceedings of ICML. The idea is to build ABC summaries by way of reproducing kernel Hilbert spaces (RKHS). Regressing such embeddings to the “optimal” choice of summary statistics by kernel ridge regression. With a possibility to derive summary statistics for quantities of interest rather than for the entire parameter vector. The use of RKHS reminds me of Arthur Gretton’s approach to ABC, although I see no mention made of that work in the current paper.

In the RKHS pseudo-linear formulation, the prediction of a parameter value given a sample attached to this value looks like a ridge estimator in classical linear estimation. (I thus wonder at why one would stop at the ridge stage instead of getting the full Bayes treatment!) Things get a bit more involved in the case of parameters (and observations) of interest, as the modelling requires two RKHS, because of the conditioning on the nuisance observations. Or rather three RHKS. Since those involve a maximum mean discrepancy between probability distributions, which define in turn a sort of intrinsic norm, I also wonder at a Wasserstein version of this approach.

What I find hard to understand in the paper is how a large-dimension large-size sample can be managed by such methods with no visible loss of information and no explosion of the computing budget. The authors mention Fourier features, which never rings a bell for me, but I wonder how this operates in a general setting, i.e., outside the iid case. The examples do not seem to go into enough details for me to understand how this massive dimension reduction operates (and they remain at a moderate level in terms of numbers of parameters). I was hoping Jovana Mitrovic could present her work here at the 17w5025 workshop but she sadly could not make it to Banff for lack of funding!