Archive for likelihood-free methods

1500 nuances of gan [gan gan style]

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on February 16, 2018 by xi'an

I recently realised that there is a currently very popular trend in machine learning called GAN [for generative adversarial networks] that strongly connects with ABC, at least in that it relies mostly on the availability of a generative model, i.e., a probability model that can be generated as in x=G(ϵ;θ), to draw inference about θ [or predictions]. For instance, there was a GANs tutorial at NIPS 2016 by Ian Goodfellow and many talks on the topic at recent NIPS, the 1500 in the title referring to the citations of the GAN paper by Goodfellow et al. (2014). (The name adversarial comes from opposing true model to generative model in the inference. )

If you remember Jeffreys‘s famous pique about classical tests as being based on improbable events that did not happen, GAN, like ABC,  is sort of the opposite in that it generates events until the one that was observed happens. More precisely, by generating pseudo-samples and switching parameters θ until these samples get as confused as possible between the data generating (“true”) distribution and the generative one. (In its original incarnation, GAN is indeed an optimisation scheme in θ.) A basic presentation of GAN is that it constructs a function D(x,ϕ) that represents the probability that x came from the true model p versus the generative model, ϕ being the parameter of a neural network trained to this effect, aimed at minimising in ϕ a two-term objective function

E[log D(x,ϕ)]+E[log(1D(G(ϵ;θ),ϕ))]

where the first expectation is taken under the true model and the second one under the generative model.

“The discriminator tries to best distinguish samples away from the generator. The generator tries to produce samples that are indistinguishable by the discriminator.” Edward

One ABC perception of this technique is that the confusion rate


is a form of distance between the data and the generative model. Which expectation can be approximated by repeated simulations from this generative model. Which suggests an extension from the optimisation approach to a ABCyesian version by selecting the smallest distances across a range of θ‘s simulated from the prior.

This notion relates to solution using classification tools as density ratio estimation, connecting for instance to Gutmann and Hyvärinen (2012). And ultimately with Geyer’s 1992 normalising constant estimator.

Another link between ABC and networks also came out during that trip. Proposed by Bishop (1994), mixture density networks (MDN) are mixture representations of the posterior [with component parameters functions of the data] trained on the prior predictive through a neural network. These MDNs can be trained on the ABC learning table [based on a specific if redundant choice of summary statistics] and used as substitutes to the posterior distribution, which brings an interesting alternative to Simon Wood’s synthetic likelihood. In a paper I missed Papamakarios and Murray suggest replacing regular ABC with this version…

ABC in Edinburgh

Posted in Books, Mountains, pictures, Statistics, Travel, University life, Wines with tags , , , , , , , , , , on January 25, 2018 by xi'an

As mentioned earlier on the ‘Og, there will be a satellite workshop ABC in Edinburgh, prior to the main ISBA meeting, taking place on Sunday, 24 June, 2018. The workshop is the last item of a series of “ABC in…” workshops on approximate Bayesian computation (ABC) and likelihood-free inference, which started with ABC in Paris in 2009! In this iteration, contributed talks and contributed posters can be submitted prior to May 1. (And there is a [extra] registration fee of 50 euros. And the deadline for early registration at ISBA 2018 is March 31, at the rather sharp rate of £380 for ISBA members.)The workshop is aimed at specialists and novices interested in statistical inference with complex models where exact computation of the likelihood function is not possible. The meeting will bring together researchers and practitioners in approximate Bayesian computation, likelihood-free inference, and related methods to discuss recent work on theoretical underpinnings, computational advances, and applications.

The invited speakers are

I am looking forward the workshop, having already booked my accommodation in the good City of Edinburgh.

(Disclaimer: I am not part of the scientific committee this round.)


approximate likelihood

Posted in Books, Statistics with tags , , , , , on September 6, 2017 by xi'an

Today, I read a newly arXived paper by Stephen Gratton on a method called GLASS for General Likelihood Approximate Solution Scheme… The starting point is the same as with ABC or synthetic likelihood, namely a collection of summary statistics and an intractable likelihood. The author proposes to use as a substitute a maximum entropy solution based on these summary statistics and their assumed moments under the theoretical model. What is quite unclear in the paper is whether or not these assumed moments are available in closed form or not. Otherwise, it would appear as a variant to the synthetic likelihood [aka simulated moments] approach, meaning that the expectations of the summary statistics under the theoretical model and for a given value of the parameter are obtained through Monte Carlo approximations. (All the examples therein allow for closed form expressions.)

model misspecification in ABC

Posted in Statistics with tags , , , , , , , , on August 21, 2017 by xi'an

With David Frazier and Judith Rousseau, we just arXived a paper studying the impact of a misspecified model on the outcome of an ABC run. This is a question that naturally arises when using ABC, but that has been not directly covered in the literature apart from a recently arXived paper by James Ridgway [that was earlier this month commented on the ‘Og]. On the one hand, ABC can be seen as a robust method in that it focus on the aspects of the assumed model that are translated by the [insufficient] summary statistics and their expectation. And nothing else. It is thus tolerant of departures from the hypothetical model that [almost] preserve those moments. On the other hand, ABC involves a degree of non-parametric estimation of the intractable likelihood, which may sound even more robust, except that the likelihood is estimated from pseudo-data simulated from the “wrong” model in case of misspecification.

In the paper, we examine how the pseudo-true value of the parameter [that is, the value of the parameter of the misspecified model that comes closest to the generating model in terms of Kullback-Leibler divergence] is asymptotically reached by some ABC algorithms like the ABC accept/reject approach and not by others like the popular linear regression [post-simulation] adjustment. Which suprisingly concentrates posterior mass on a completely different pseudo-true value. Exploiting our recent assessment of ABC convergence for well-specified models, we show the above convergence result for a tolerance sequence that decreases to the minimum possible distance [between the true expectation and the misspecified expectation] at a slow enough rate. Or that the sequence of acceptance probabilities goes to zero at the proper speed. In the case of the regression correction, the pseudo-true value is shifted by a quantity that does not converge to zero, because of the misspecification in the expectation of the summary statistics. This is not immensely surprising but we hence get a very different picture when compared with the well-specified case, when regression corrections bring improvement to the asymptotic behaviour of the ABC estimators. This discrepancy between two versions of ABC can be exploited to seek misspecification diagnoses, e.g. through the acceptance rate versus the tolerance level, or via a comparison of the ABC approximations to the posterior expectations of quantities of interest which should diverge at rate Vn. In both cases, ABC reference tables/learning bases can be exploited to draw and calibrate a comparison with the well-specified case.

probably ABC [and provably robust]

Posted in Books, pictures, Statistics, Travel with tags , , , , , , , , on August 8, 2017 by xi'an

Two weeks ago, James Ridgway (formerly CREST) arXived a paper on misspecification and ABC, a topic on which David Frazier, Judith Rousseau and I have been working for a while now [and soon to be arXived as well].  Paper that I re-read on a flight to Amsterdam [hence the above picture], written as a continuation of our earlier paper with David, Gael, and Judith. One specificity of the paper is to use an exponential distribution on the distance between the observed and simulated sample within the ABC distribution. Which reminds me of the resolution by Bissiri, Holmes, and Walker (2016) of the intractability of the likelihood function. James’ paper contains oracle inequalities between the ABC approximation and the genuine distribution of the summary statistics, like a bound on the distance between the expectations of the summary statistics under both models. Which writes down as a sum of a model bias, of two divergences between empirical and theoretical averages, on smoothness penalties, and on a prior impact term. And a similar bound on the distance between the expected distance to the oracle estimator of θ under the ABC distribution [and a Lipschitz type assumption also found in our paper]. Which first sounded weird [to me] as I would have expected the true posterior, until it dawned on me that the ABC distribution is the one used for the estimation [a passing strike of over-Bayesianism!]. While the oracle bound could have been used directly to discuss the rate of convergence of the exponential rate λ to zero [with the sample size n], James goes into the interesting alternative direction of setting a prior on λ, an idea that dates back to Olivier Catoni and Peter Grünwald. Or rather a pseudo-posterior on λ, a common occurrence in the PAC-Bayesian literature. In one of his results, James obtains a dependence of λ on the dimension m of the summary [as well as the root dependence on the sample size n], which seems to contradict our earlier independence result, until one realises this scale parameter is associated with a distance variable, itself scaled in m.

The paper also contains a non-parametric part, where the parameter θ is the unknown distribution of the data and the summary the data itself. Which is quite surprising as I did not deem it possible to handle non-parametrics with ABC. Especially in a misspecified setting (although I have trouble perceiving what this really means).

“We can use most of the Monte Carlo toolbox available in this context.”

The theoretical parts are a bit heavy on notations and hard to read [as a vacation morning read at least!]. They are followed by a Monte Carlo implementation using SMC-ABC.  And pseudo-marginals [at least formally as I do not see how the specific features of pseudo-marginals are more that an augmented representation here]. And adaptive multiple pseudo-samples that reminded me of the Biometrika paper of Anthony Lee and Krys Latuszynski (Warwick). Therefore using indeed most of the toolbox!

machine learning-based approach to likelihood-free inference

Posted in Statistics with tags , , , , , , , , , , , on March 3, 2017 by xi'an

polyptych painting within the TransCanada Pipeline Pavilion, Banff Centre, Banff, March 21, 2012At ABC’ory last week, Kyle Cranmer gave an extended talk on estimating the likelihood ratio by classification tools. Connected with a 2015 arXival. The idea is that the likelihood ratio is invariant by a transform s(.) that is monotonic with the likelihood ratio itself. It took me a few minutes (after the talk) to understand what this meant. Because it is a transform that actually depends on the parameter values in the denominator and the numerator of the ratio. For instance the ratio itself is a proper transform in the sense that the likelihood ratio based on the distribution of the likelihood ratio under both parameter values is the same as the original likelihood ratio. Or the (naïve Bayes) probability version of the likelihood ratio. Which reminds me of the invariance in Fearnhead and Prangle (2012) of the Bayes estimate given x and of the Bayes estimate given the Bayes estimate. I also feel there is a connection with Geyer’s logistic regression estimate of normalising constants mentioned several times on the ‘Og. (The paper mentions in the conclusion the connection with this problem.)

Now, back to the paper (which I read the night after the talk to get a global perspective on the approach), the ratio is of course unknown and the implementation therein is to estimate it by a classification method. Estimating thus the probability for a given x to be from one versus the other distribution. Once this estimate is produced, its distributions under both values of the parameter can be estimated by density estimation, hence an estimated likelihood ratio be produced. With better prospects since this is a one-dimensional quantity. An objection to this derivation is that it intrinsically depends on the pair of parameters θ¹ and θ² used therein. Changing to another pair requires a new ratio, new simulations, and new density estimations. When moving to a continuous collection of parameter values, in a classical setting, the likelihood ratio involves two maxima, which can be formally represented in (3.3) as a maximum over a likelihood ratio based on the estimated densities of likelihood ratios, except that each evaluation of this ratio seems to require another simulation. (Which makes the comparison with ABC more complex than presented in the paper [p.18], since ABC major computational hurdle lies in the production of the reference table and to a lesser degree of the local regression, both items that can be recycled for any new dataset.) A smoothing step is then to include the pair of parameters θ¹ and θ² as further inputs of the classifier.  There still remains the computational burden of simulating enough values of s(x) towards estimating its density for every new value of θ¹ and θ². And while the projection from x to s(x) does effectively reduce the dimension of the problem to one, the method still aims at estimating with some degree of precision the density of x, so cannot escape the curse of dimensionality. The sleight of hand resides in the classification step, since it is equivalent to estimating the likelihood ratio. I thus fail to understand how and why a poor classifier can then lead to a good approximations of the likelihood ratio “obtained by calibrating s(x)” (p.16). Where calibrating means estimating the density.

ABC in Les Diablerets

Posted in Statistics with tags , , , , , , , , , , on February 14, 2017 by xi'an

Since I could not download the slides of my ABC course in Les Diablerets in one go, I broke them by chapters as follows. (Warning: there is very little novelty in those slides, except for the final part on consistency.)

Although I did not do it on purpose (!), starting with indirect inference and other methods inspired from econometrics induced some discussion in the first hour of the course with econometricians in the room. Including Elvezio Ronchetti.

I also regretted piling too much material in the alphabet soup, as it was too widespread for a new audience. And as I could not keep the coherence of the earlier parts by going thru so many papers at once. Especially since I was a bit knackered after a day of skiing….

I managed to get to the final convergence chapter on the last day, even though I had to skip some of the earlier material. Which should be reorganised anyway as the parts between model choice with random forests and inference with random forests are not fully connected!