Archive for Charlie Geyer

likelihood-free inference by ratio estimation

Posted in Books, Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , on September 9, 2019 by xi'an

“This approach for posterior estimation with generative models mirrors the approach of Gutmann and Hyvärinen (2012) for the estimation of unnormalised models. The main difference is that here we classify between two simulated data sets while Gutmann and Hyvärinen (2012) classified between the observed data and simulated reference data.”

A 2018 arXiv posting by Owen Thomas et al. (including my colleague at Warwick, Rito Dutta, CoI warning!) about estimating the likelihood (and the posterior) when it is intractable. Likelihood-free but not ABC, since the ratio likelihood to marginal is estimated in a non- or semi-parametric (and biased) way. Following Geyer’s 1994 fabulous estimate of an unknown normalising constant via logistic regression, the current paper which I read in preparation for my discussion in the ABC optimal design in Salzburg uses probabilistic classification and an exponential family representation of the ratio. Opposing data from the density and data from the marginal, assuming both can be readily produced. The logistic regression minimizing the asymptotic classification error is the logistic transform of the log-ratio. For a finite (double) sample, this minimization thus leads to an empirical version of the ratio. Or to a smooth version if the log-ratio is represented as a convex combination of summary statistics, turning the approximation into an exponential family,  which is a clever way to buckle the buckle towards ABC notions. And synthetic likelihood. Although with a difference in estimating the exponential family parameters β(θ) by minimizing the classification error, parameters that are indeed conditional on the parameter θ. Actually the paper introduces a further penalisation or regularisation term on those parameters β(θ), which could have been processed by Bayesian Lasso instead. This step is essentially dirving the selection of the summaries, except that it is for each value of the parameter θ, at the expense of a X-validation step. This is quite an original approach, as far as I can tell, but I wonder at the link with more standard density estimation methods, in particular in terms of the precision of the resulting estimate (and the speed of convergence with the sample size, if convergence there is).

conditional noise contrastive estimation

Posted in Books, pictures, University life with tags , , , , , , , , on August 13, 2019 by xi'an

At ICML last year, Ciwan Ceylan and Michael Gutmann presented a new version of noise constrative estimation to deal with intractable constants. While noise contrastive estimation relies upon a second independent sample to contrast with the observed sample, this approach uses instead a perturbed or noisy version of the original sample, for instance a Normal generation centred at the original datapoint. And eliminates the annoying constant by breaking the (original and noisy) samples into two groups. The probability to belong to one group or the other then does not depend on the constant, which is a very effective trick. And can be optimised with respect to the parameters of the model of interest. Recovering the score matching function of Hyvärinen (2005). While this is in line with earlier papers by Gutmann and Hyvärinen, this line of reasoning (starting with Charlie Geyer’s logistic regression) never ceases to amaze me!

1500 nuances of gan [gan gan style]

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on February 16, 2018 by xi'an

I recently realised that there is a currently very popular trend in machine learning called GAN [for generative adversarial networks] that strongly connects with ABC, at least in that it relies mostly on the availability of a generative model, i.e., a probability model that can be generated as in x=G(ϵ;θ), to draw inference about θ [or predictions]. For instance, there was a GANs tutorial at NIPS 2016 by Ian Goodfellow and many talks on the topic at recent NIPS, the 1500 in the title referring to the citations of the GAN paper by Goodfellow et al. (2014). (The name adversarial comes from opposing true model to generative model in the inference. )

If you remember Jeffreys‘s famous pique about classical tests as being based on improbable events that did not happen, GAN, like ABC,  is sort of the opposite in that it generates events until the one that was observed happens. More precisely, by generating pseudo-samples and switching parameters θ until these samples get as confused as possible between the data generating (“true”) distribution and the generative one. (In its original incarnation, GAN is indeed an optimisation scheme in θ.) A basic presentation of GAN is that it constructs a function D(x,ϕ) that represents the probability that x came from the true model p versus the generative model, ϕ being the parameter of a neural network trained to this effect, aimed at minimising in ϕ a two-term objective function

E[log D(x,ϕ)]+E[log(1D(G(ϵ;θ),ϕ))]

where the first expectation is taken under the true model and the second one under the generative model.

“The discriminator tries to best distinguish samples away from the generator. The generator tries to produce samples that are indistinguishable by the discriminator.” Edward

One ABC perception of this technique is that the confusion rate

E[log(1D(G(ϵ;θ),ϕ))]

is a form of distance between the data and the generative model. Which expectation can be approximated by repeated simulations from this generative model. Which suggests an extension from the optimisation approach to a ABCyesian version by selecting the smallest distances across a range of θ‘s simulated from the prior.

This notion relates to solution using classification tools as density ratio estimation, connecting for instance to Gutmann and Hyvärinen (2012). And ultimately with Geyer’s 1992 normalising constant estimator.

Another link between ABC and networks also came out during that trip. Proposed by Bishop (1994), mixture density networks (MDN) are mixture representations of the posterior [with component parameters functions of the data] trained on the prior predictive through a neural network. These MDNs can be trained on the ABC learning table [based on a specific if redundant choice of summary statistics] and used as substitutes to the posterior distribution, which brings an interesting alternative to Simon Wood’s synthetic likelihood. In a paper I missed Papamakarios and Murray suggest replacing regular ABC with this version…

importance sampling with multiple MCMC sequences

Posted in Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , , , on October 2, 2015 by xi'an

Vivek Roy, Aixian Tan and James Flegal arXived a new paper, Estimating standard errors for importance sampling estimators with multiple Markov chains, where they obtain a central limit theorem and hence standard error estimates when using several MCMC chains to simulate from a mixture distribution as an importance sampling function. Just before I boarded my plane from Amsterdam to Calgary, which gave me the opportunity to read it completely (along with half a dozen other papers, since it is a long flight!) I first thought it was connecting to our AMIS algorithm (on which convergence Vivek spent a few frustrating weeks when he visited me at the end of his PhD), because of the mixture structure. This is actually altogether different, in that a mixture is made of unnormalised complex enough densities, to act as an importance sampler, and that, due to this complexity, the components can only be simulated via separate MCMC algorithms. Behind this characterisation lurks the challenging problem of estimating multiple normalising constants. The paper adopts the resolution by reverse logistic regression advocated in Charlie Geyer’s famous 1994 unpublished technical report. Beside the technical difficulties in establishing a CLT in this convoluted setup, the notion of mixing importance sampling and different Markov chains is quite appealing, especially in the domain of “tall” data and of splitting the likelihood in several or even many bits, since the mixture contains most of the information provided by the true posterior and can be corrected by an importance sampling step. In this very setting, I also think more adaptive schemes could be found to determine (estimate?!) the optimal weights of the mixture components.

quantile functions: mileage may vary

Posted in Books, R, Statistics with tags , , , , , , on May 12, 2015 by xi'an

When experimenting with various quantiles functions in R, I was shocked [ok this is a bit excessive, let us say surprised] by how widely the execution times would vary. To the point of blaming a completely different feature of R. Borrowing from Charlie Geyer’s webpage on the topic of probability distributions in R, here is a table for some standard distributions: I ran

u=runif(1e7)
system.time(x<-qcauchy(u))

choosing an arbitrary parameter whenever needed.

Distribution Function Time
Cauchy qcauchy 2.2
Chi-Square qchisq 43.8
Exponential qexp 0.95
F qf 34.2
Gamma qgamma 37.2
Logistic qlogis 1.7
Log Normal qlnorm 2.2
Normal qnorm 1.4
Student t qt 31.7
Uniform qunif 0.86
Weibull qweibull 2.9

Of course, it does not mean much in that all the slow distributions (except for Weibull) are parameterised. Nonetheless, that a chi-square inversion take 50 times longer than a uniform inversion remains puzzling as to why it is not coded more efficiently. In particular, I was wondering why the chi-square inversion was slower than the Gamma inversion. Rerunning both inversions showed that they are equivalent:

> u=runif(1e7)
> system.time(x<-qgamma(u,sha=1.5))
utilisateur système écoulé
 21.534 0.016 21.532
> system.time(x<-qchisq(u,df=3))
utilisateur système écoulé
21.372 0.008 21.361

Which also shows how variable system.time can be.