Archive for ABC

Bayesian goodness of fit

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , on April 10, 2018 by xi'an


Persi Diaconis and Guanyang Wang have just arXived an interesting reflection on the notion of Bayesian goodness of fit tests. Which is a notion that has always bothered me, in a rather positive sense (!), as

“I also have to confess at the outset to the zeal of a convert, a born again believer in stochastic methods. Last week, Dave Wright reminded me of the advice I had given a graduate student during my algebraic geometry days in the 70’s :`Good Grief, don’t waste your time studying statistics. It’s all cookbook nonsense.’ I take it back! …” David Mumford

The paper starts with a reference to David Mumford, whose paper with Wu and Zhou on exponential “maximum entropy” synthetic distributions is at the source (?) of this paper, and whose name appears in its very title: “A conversation for David Mumford”…, about his conversion from pure (algebraic) maths to applied maths. The issue of (Bayesian) goodness of fit is addressed, with card shuffling examples, the null hypothesis being that the permutation resulting from the shuffling is uniformly distributed if shuffling takes enough time. Interestingly, while the parameter space is compact as a distribution on a finite set, Lindley’s paradox still occurs, namely that the null (the permutation comes from a Uniform) is always accepted provided there is no repetition under a “flat prior”, which is the Dirichlet D(1,…,1) over all permutations. (In this finite setting an improper prior is definitely improper as it does not get proper after accounting for observations. Although I do not understand why the Jeffreys prior is not the Dirichlet(½,…,½) in this case…) When resorting to the exponential family of distributions entertained by Zhou, Wu and Mumford, including the uniform distribution as one of its members, Diaconis and Wang advocate the use of a conjugate prior (exponential family, right?!) to compute a Bayes factor that simplifies into a ratio of two intractable normalising constants. For which the authors suggest using importance sampling, thermodynamic integration, or the exchange algorithm. Except that they rely on the (dreaded) harmonic mean estimator for computing the Bayes factor in the following illustrative section! Due to the finite nature of the space, I presume this estimator still has a finite variance. (Remark 1 calls for convergence results on exchange algorithms, which can be found I think in the just as recent arXival by Christophe Andrieu and co-authors.) An interesting if rare feature of the example processed in the paper is that the sufficient statistic used for the permutation model can be directly simulated from a Multinomial distribution. This is rare as seen when considering the benchmark of Ising models, for which the summary and sufficient statistic cannot be directly simulated. (If only…!) In fine, while I enjoyed the paper a lot, I remain uncertain as to its bearings, since defining an objective alternative for the goodness-of-fit test becomes quickly challenging outside simple enough models.

approximate Bayesian inference under informative sampling

Posted in Books, Statistics, Travel, University life with tags , , , , , , , , , on March 30, 2018 by xi'an

In the first issue of this year Biometrika, I spotted a paper with the above title, written by Wang, Kim, and Yang, and thought it was a particular case of ABC. However, when I read it on a rare metro ride to Dauphine, thanks to my hurting knee!, I got increasingly disappointed as the contents had nothing to do with ABC. The purpose of the paper was to derive a consistent and convergent posterior distribution based on a estimator of the parameter θ that is… consistent and convergent under informative sampling. Using for instance a Normal approximation to the sampling distribution of this estimator. Or to the sampling distribution of the pseudo-score function, S(θ) [which pseudo-normality reminded me of Ron Gallant’s approximations and of my comments on them]. The paper then considers a generalisation to the case of estimating equations, U(θ), which may again enjoy a Normal asymptotic distribution. Involving an object that does not make direct Bayesian sense, namely the posterior of the parameter θ given U(θ)…. (The algorithm proposed to generate from this posterior (8) is also a mystery.) Since the approach requires consistent estimators to start with and aims at reproducing frequentist coverage properties, I am thus at a loss as to why this pseudo-Bayesian framework is adopted.

ABCDay [arXivals]

Posted in Books, Statistics, University life with tags , , , , , , on March 2, 2018 by xi'an

A bunch of ABC papers on arXiv yesterday, most of them linked to the incoming Handbook of ABC:

    1. Overview of Approximate Bayesian Computation S. A. Sisson, Y. Fan, M. A. Beaumont
    2. Kernel Recursive ABC: Point Estimation with Intractable Likelihood Takafumi Kajihara, Keisuke Yamazaki, Motonobu Kanagawa, Kenji Fukumizu
    3. High-dimensional ABC D. J. Nott, V. M.-H. Ong, Y. Fan, S. A. Sisson
    4. ABC Samplers Y. Fan, S. A. Sisson


ABCDE for approximate Bayesian conditional density estimation

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , on February 26, 2018 by xi'an

Another arXived paper I surprisingly (?) missed, by George Papamakarios and Iain Murray, on an ABCDE (my acronym!) substitute to ABC for generative models. The paper was reviewed [with reviews made available!] and accepted by NIPS 2016. (Most obviously, I was not one of the reviewers!)

“Conventional ABC algorithms such as the above suffer from three drawbacks. First, they only represent the parameter posterior as a set of (possibly weighted or correlated) samples [for which] it is not obvious how to perform some other computations using samples, such as combining posteriors from two separate analyses. Second, the parameter samples do not come from the correct Bayesian posterior (…) Third, as the ε-tolerance is reduced, it can become impractical to simulate the model enough times to match the observed data even once [when] simulations are expensive to perform”

The above criticisms are a wee bit overly harsh as, well…, Monte Carlo approximations remain a solution worth considering for all Bayesian purposes!, while the approximation [replacing the data with a ball] in ABC is replaced with an approximation of the true posterior as a mixture. Both requiring repeated [and likely expensive] simulations. The alternative is in iteratively simulating from pseudo-predictives towards learning better pseudo-posteriors, then used as new proposals at the next iteration modulo an importance sampling correction.  The approximation to the posterior chosen therein is a mixture density network, namely a mixture distribution with parameters obtained as neural networks based on the simulated pseudo-observations. Which the authors claim [p.4] requires no tuning. (Still, there are several aspects to tune, from the number of components to the hyper-parameter λ [p.11, eqn (35)], to the structure of the neural network [20 tanh? 50 tanh?], to the number of iterations, to the amount of X checking. As usual in NIPS papers, it is difficult to assess how arbitrary the choices made in the experiments are. Unless one starts experimenting with the codes provided.) All in all, I find the paper nonetheless exciting enough (!) to now start a summer student project on it in Dauphine and hope to check the performances of ABCDE on different models, as well as comparing this ABC implementation with a synthetic likelihood version.

 As an addendum, let me point out the very pertinent analysis of this paper by Dennis Prangle, 18 months ago!

1500 nuances of gan [gan gan style]

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on February 16, 2018 by xi'an

I recently realised that there is a currently very popular trend in machine learning called GAN [for generative adversarial networks] that strongly connects with ABC, at least in that it relies mostly on the availability of a generative model, i.e., a probability model that can be generated as in x=G(ϵ;θ), to draw inference about θ [or predictions]. For instance, there was a GANs tutorial at NIPS 2016 by Ian Goodfellow and many talks on the topic at recent NIPS, the 1500 in the title referring to the citations of the GAN paper by Goodfellow et al. (2014). (The name adversarial comes from opposing true model to generative model in the inference. )

If you remember Jeffreys‘s famous pique about classical tests as being based on improbable events that did not happen, GAN, like ABC,  is sort of the opposite in that it generates events until the one that was observed happens. More precisely, by generating pseudo-samples and switching parameters θ until these samples get as confused as possible between the data generating (“true”) distribution and the generative one. (In its original incarnation, GAN is indeed an optimisation scheme in θ.) A basic presentation of GAN is that it constructs a function D(x,ϕ) that represents the probability that x came from the true model p versus the generative model, ϕ being the parameter of a neural network trained to this effect, aimed at minimising in ϕ a two-term objective function

E[log D(x,ϕ)]+E[log(1D(G(ϵ;θ),ϕ))]

where the first expectation is taken under the true model and the second one under the generative model.

“The discriminator tries to best distinguish samples away from the generator. The generator tries to produce samples that are indistinguishable by the discriminator.” Edward

One ABC perception of this technique is that the confusion rate


is a form of distance between the data and the generative model. Which expectation can be approximated by repeated simulations from this generative model. Which suggests an extension from the optimisation approach to a ABCyesian version by selecting the smallest distances across a range of θ‘s simulated from the prior.

This notion relates to solution using classification tools as density ratio estimation, connecting for instance to Gutmann and Hyvärinen (2012). And ultimately with Geyer’s 1992 normalising constant estimator.

Another link between ABC and networks also came out during that trip. Proposed by Bishop (1994), mixture density networks (MDN) are mixture representations of the posterior [with component parameters functions of the data] trained on the prior predictive through a neural network. These MDNs can be trained on the ABC learning table [based on a specific if redundant choice of summary statistics] and used as substitutes to the posterior distribution, which brings an interesting alternative to Simon Wood’s synthetic likelihood. In a paper I missed Papamakarios and Murray suggest replacing regular ABC with this version…

El asiedo [book review]

Posted in Books, pictures, Travel, Wines with tags , , , , , , , , , on January 13, 2018 by xi'an

Just finished this long book by Arturo Pérez-Reverte that I bought [in its French translation] after reading the fascinating Dos de Mayo about the rebellion of the people of Madrid against the Napoleonian occupants. This book, The Siege, is just fantastic, more literary than Dos de Mayo and a mix of different genres, from the military to the historical, to the criminal, to the chess, to the speculative, to the romantic novel..! There are a few major characters, a police investigator, a trading company head, a corsair, a French canon engineer, a guerilla, with a well-defined unique location, the city of Cádiz under [land] siege by the French troops, but with access to the sea thanks to the British Navy. The serial killer part is certainly not the best item in the plot [as often with serial killer stories!], as it slowly drifts to the supernatural, borrowing from Laplace and Condorcet to lead to perfect predictions of where and when French bombs will fall. The historical part also appears to be rather biased against the British forces, if this opinion page is to be believed, towards a nationalist narrative making the Spanish guerilla resistance bigger and stronger than it actually was. But I still read the story with fascination and it kept me awake past my usual bedtime for several nights as I could not let the story go!

ABC forecasts

Posted in Books, pictures, Statistics with tags , , , , , , , , on January 9, 2018 by xi'an

My friends and co-authors David Frazier, Gael Martin, Brendan McCabe, and Worapree Maneesoonthorn arXived a paper on ABC forecasting at the turn of the year. ABC prediction is a natural extension of ABC inference in that, provided the full conditional of a future observation given past data and parameters is available but the posterior is not, ABC simulations of the parameters induce an approximation of the predictive. The paper thus considers the impact of this extension on the precision of the predictions. And argues that it is possible that this approximation is preferable to running MCMC in some settings. A first interesting result is that using ABC and hence conditioning on an insufficient summary statistic has no asymptotic impact on the resulting prediction, provided Bayesian concentration of the corresponding posterior takes place as in our convergence paper under revision.

“…conditioning inference about θ on η(y) rather than y makes no difference to the probabilistic statements made about [future observations]”

The above result holds both in terms of convergence in total variation and for proper scoring rules. Even though there is always a loss in accuracy in using ABC. Now, one may think this is a direct consequence of our (and others) earlier convergence results, but numerical experiments on standard time series show the distinct feature that, while the [MCMC] posterior and ABC posterior distributions on the parameters clearly differ, the predictives are more or less identical! With a potential speed gain in using ABC, although comparing parallel ABC versus non-parallel MCMC is rather delicate. For instance, a preliminary parallel ABC could be run as a burnin’ step for parallel MCMC, since all chains would then be roughly in the stationary regime. Another interesting outcome of these experiments is a case when the summary statistics produces a non-consistent ABC posterior, but still leads to a very similar predictive, as shown on this graph.This unexpected accuracy in prediction may further be exploited in state space models, towards producing particle algorithms that are greatly accelerated. Of course, an easy objection to this acceleration is that the impact of the approximation is unknown and un-assessed. However, such an acceleration leaves room for multiple implementations, possibly with different sets of summaries, to check for consistency over replicates.