Archive for ABC

Bayesian intelligence in Warwick

Posted in pictures, Statistics, Travel, University life, Wines with tags , , , , , , , , , , , , on February 18, 2019 by xi'an

This is an announcement for an exciting CRiSM Day in Warwick on 20 March 2019: with speakers

10:00-11:00 Xiao-Li Meng (Harvard): “Artificial Bayesian Monte Carlo Integration: A Practical Resolution to the Bayesian (Normalizing Constant) Paradox”

11:00-12:00 Julien Stoehr (Dauphine): “Gibbs sampling and ABC”

14:00-15:00 Arthur Ulysse Jacot-Guillarmod (École Polytechnique Fedérale de Lausanne): “Neural Tangent Kernel: Convergence and Generalization of Deep Neural Networks”

15:00-16:00 Antonietta Mira (Università della Svizzera italiana e Università degli studi dell’Insubria): “Bayesian identifications of the data intrinsic dimensions”

[whose abstracts are on the workshop webpage] and free attendance. The title for the workshop mentions Bayesian Intelligence: this obviously includes human intelligence and not just AI!

a pen for ABC

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , on February 13, 2019 by xi'an

Among the flury of papers arXived around the ICML 2019 deadline, I read on my way back from Oxford a paper by Wiqvist et al. on learning summary statistics for ABC by neural nets. Pointing out at another recent paper by Jiang et al. (2017, Statistica Sinica) which constructed a neural network for predicting each component of the parameter vector based on the input (raw) data, as an automated non-parametric regression of sorts. Creel (2017) does the same but with summary statistics. The current paper builds up from Jiang et al. (2017), by adding the constraint that exchangeability and partial exchangeability features should be reflected by the neural net prediction function. With applications to Markovian models. Due to a factorisation theorem for d-block invariant models, the authors impose partial exchangeability for order d Markov models by combining two neural networks that end up satisfying this factorisation. The concept is exemplified for one-dimension g-and-k distributions, alpha-stable distributions, both of which are made of independent observations, and the AR(2) and MA(2) models, as in our 2012 ABC survey paper. Since the later is not Markovian the authors experiment with different orders and reach the conclusion that an order of 10 is most appropriate, although this may be impacted by being a ble to handle the true likelihood.

information maximising neural networks summaries

Posted in pictures, Statistics with tags , , , , , , , , on February 6, 2019 by xi'an

After missing the blood moon eclipse last night, I had a meeting today at the Paris observatory (IAP), where we discussed an ABC proposal made by Tom Charnock, Guilhem Lavaux, and Benjamin Wandelt from this institute.

“We introduce a simulation-based machine learning technique that trains artificial neural networks to find non-linear functionals of data that maximise Fisher information : information maximising neural networks.” T. Charnock et al., 2018
The paper is centred on the determination of “optimal” summary statistics. With the goal of finding “transformation which maps the data to compressed summaries whilst conserving Fisher information [of the original data]”. Which sounds like looking for an efficient summary and hence impossible in non-exponential cases. As seen from the description in (2.1), the assumed distribution of the summary is Normal, with mean μ(θ) and covariance matrix C(θ) that are implicit transforms of the parameter θ. In that respect, the approach looks similar to the synthetic likelihood proposal of Wood (2010). From which an unusual form of Fisher information can be derived, as μ(θ)’C(θ)⁻¹μ(θ)… A neural net is trained to optimise this information criterion at a given (so-called fiducial) value of θ, in terms of a set of summaries of the same dimension as the data. Which means the information contained in the whole data (likelihood) is not necessarily recovered, linking with this comment from Edward Ionides (in a set of lectures at Wharton).
“Even summary statistics derived by careful scientific or statistical reasoning have been found surprisingly uninformative compared to the whole data likelihood in both scientific investigations (Shrestha et al., 2011) and simulation experiments (Fasiolo et al., 2016)” E. Ionides, slides, 2017
The maximal Fisher information obtained in this manner is then used in a subsequent ABC step as the natural metric for the distance between the observed and simulated data. (Begging the question as to why being maximal is necessarily optimal.) Another question is about the choice of the fiducial parameter, which choice should be tested by for instance iterating the algorithm a few steps. But having to run simulations for a single value of the parameter is certainly a great selling point!

Computational Bayesian Statistics [book review]

Posted in Books, Statistics with tags , , , , , , , , , , , , , , , , , , , , , , , , , , , , on February 1, 2019 by xi'an

This Cambridge University Press book by M. Antónia Amaral Turkman, Carlos Daniel Paulino, and Peter Müller is an enlarged translation of a set of lecture notes in Portuguese. (Warning: I have known Peter Müller from his PhD years in Purdue University and cannot pretend to perfect objectivity. For one thing, Peter once brought me frozen-solid beer: revenge can also be served cold!) Which reminds me of my 1994 French edition of Méthodes de Monte Carlo par chaînes de Markov, considerably upgraded into Monte Carlo Statistical Methods (1998) thanks to the input of George Casella. (Re-warning: As an author of books on the same topic(s), I can even less pretend to objectivity.)

“The “great idea” behind the development of computational Bayesian statistics is the recognition that Bayesian inference can be implemented by way of simulation from the posterior distribution.”

The book is written from a strong, almost militant, subjective Bayesian perspective (as, e.g., when half-Bayesians are mentioned!). Subjective (and militant) as in Dennis Lindley‘s writings, eminently quoted therein. As well as in Tony O’Hagan‘s. Arguing that the sole notion of a Bayesian estimator is the entire posterior distribution. Unless one brings in a loss function. The book also discusses the Bayes factor in a critical manner, which is fine from my perspective.  (Although the ban on improper priors makes its appearance in a very indirect way at the end of the last exercise of the first chapter.)

Somewhat at odds with the subjectivist stance of the previous chapter, the chapter on prior construction only considers non-informative and conjugate priors. Which, while understandable in an introductory book, is a wee bit disappointing. (When mentioning Jeffreys’ prior in multidimensional settings, the authors allude to using univariate Jeffreys’ rules for the marginal prior distributions, which is not a well-defined concept or else Bernardo’s and Berger’s reference priors would not have been considered.) The chapter also mentions the likelihood principle at the end of the last exercise, without a mention of the debate about its derivation by Birnbaum. Or Deborah Mayo’s recent reassessment of the strong likelihood principle. The following chapter is a sequence of illustrations in classical exponential family models, classical in that it is found in many Bayesian textbooks. (Except for the Poison model found in Exercise 3.3!)

Nothing to complain (!) about the introduction of Monte Carlo methods in the next chapter, especially about the notion of inference by Monte Carlo methods. And the illustration by Bayesian design. The chapter also introduces Rao-Blackwellisation [prior to introducing Gibbs sampling!]. And the simplest form of bridge sampling. (Resuscitating the weighted bootstrap of Gelfand and Smith (1990) may not be particularly urgent for an introduction to the topic.) There is furthermore a section on sequential Monte Carlo, including the Kalman filter and particle filters, in the spirit of Pitt and Shephard (1999). This chapter is thus rather ambitious in the amount of material covered with a mere 25 pages. Consensus Monte Carlo is even mentioned in the exercise section.

“This and other aspects that could be criticized should not prevent one from using this [Bayes factor] method in some contexts, with due caution.”

Chapter 5 turns back to inference with model assessment. Using Bayesian p-values for model assessment. (With an harmonic mean spotted in Example 5.1!, with no warning about the risks, except later in 5.3.2.) And model comparison. Presenting the whole collection of xIC information criteria. from AIC to WAIC, including a criticism of DIC. The chapter feels somewhat inconclusive but methinks this is the right feeling on the current state of the methodology for running inference about the model itself.

“Hint: There is a very easy answer.”

Chapter 6 is also a mostly standard introduction to Metropolis-Hastings algorithms and the Gibbs sampler. (The argument given later of a Metropolis-Hastings algorithm with acceptance probability one does not work.) The Gibbs section also mentions demarginalization as a [latent or auxiliary variable] way to simulate from complex distributions [as we do], but without defining the notion. It also references the precursor paper of Tanner and Wong (1987). The chapter further covers slice sampling and Hamiltonian Monte Carlo, the later with sufficient details to lead to reproducible implementations. Followed by another standard section on convergence assessment, returning to the 1990’s feud of single versus multiple chain(s). The exercise section gets much larger than in earlier chapters with several pages dedicated to most problems. Including one on ABC, maybe not very helpful in this context!

“…dimension padding (…) is essentially all that is to be said about the reversible jump. The rest are details.”

The next chapter is (somewhat logically) the follow-up for trans-dimensional problems and marginal likelihood approximations. Including Chib’s (1995) method [with no warning about potential biases], the spike & slab approach of George and McCulloch (1993) that I remember reading in a café at the University of Wyoming!, the somewhat antiquated MC³ of Madigan and York (1995). And then the much more recent array of Bayesian lasso techniques. The trans-dimensional issues are covered by the pseudo-priors of Carlin and Chib (1995) and the reversible jump MCMC approach of Green (1995), the later being much more widely employed in the literature, albeit difficult to tune [and even to comprehensively describe, as shown by the algorithmic representation in the book] and only recommended for a large number of models under comparison. Once again the exercise section is most detailed, with recent entries like the EM-like variable selection algorithm of Ročková and George (2014).

The book also includes a chapter on analytical approximations, which is also the case in ours [with George Casella] despite my reluctance to bring them next to exact (simulation) methods. The central object is the INLA methodology of Rue et al. (2009) [absent from our book for obvious calendar reasons, although Laplace and saddlepoint approximations are found there as well]. With a reasonable amount of details, although stopping short of implementable reproducibility. Variational Bayes also makes an appearance, mostly following the very recent Blei et al. (2017).

The gem and originality of the book are primarily to be found in the final and ninth chapter where four software are described, all with interfaces to R: OpenBUGS, JAGS, BayesX, and Stan, plus R-INLA which is processed in the second half of the chapter (because this is not a simulation method). As in the remainder of the book, the illustrations are related to medical applications. Worth mentioning is the reminder that BUGS came in parallel with Gelfand and Smith (1990) Gibbs sampler rather than as a consequence. Even though the formalisation of the Markov chain Monte Carlo principle by the later helped in boosting the power of this software. (I also appreciated the mention made of Sylvia Richardson’s role in this story.) Since every software is illustrated in depth with relevant code and output, and even with the shortest possible description of its principle and modus vivendi, the chapter is 60 pages long [and missing a comparative conclusion]. Given my total ignorance of the very existence of the BayesX software, I am wondering at the relevance of its inclusion in this description rather than, say, other general R packages developed by authors of books such as Peter Rossi. The chapter also includes a description of CODA, with an R version developed by Martin Plummer [now a Warwick colleague].

In conclusion, this is a high-quality and all-inclusive introduction to Bayesian statistics and its computational aspects. By comparison, I find it much more ambitious and informative than Albert’s. If somehow less pedagogical than the thicker book of Richard McElreath. (The repeated references to Paulino et al.  (2018) in the text do not strike me as particularly useful given that this other book is written in Portuguese. Unless an English translation is in preparation.)

Disclaimer: this book was sent to me by CUP for endorsement and here is what I wrote in reply for a back-cover entry:

An introduction to computational Bayesian statistics cooked to perfection, with the right mix of ingredients, from the spirited defense of the Bayesian approach, to the description of the tools of the Bayesian trade, to a definitely broad and very much up-to-date presentation of Monte Carlo and Laplace approximation methods, to an helpful description of the most common software. And spiced up with critical perspectives on some common practices and an healthy focus on model assessment and model selection. Highly recommended on the menu of Bayesian textbooks!

And this review is likely to appear in CHANCE, in my book reviews column.

prepaid ABC

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , on January 16, 2019 by xi'an

Merijn Mestdagha, Stijn Verdoncka, Kristof Meersa, Tim Loossensa, and Francis Tuerlinckx from the KU Leuven, some of whom I met during a visit to its Wallon counterpart Louvain-La-Neuve, proposed and arXived a new likelihood-free approach based on saving simulations on a large scale for future users. Future users interested in the same model. The very same model. This makes the proposal quite puzzling as I have no idea as to when situations with exactly the same experimental conditions, up to the sample size, repeat over and over again. Or even just repeat once. (Some particular settings may accommodate for different sample sizes and the same prepaid database, but others as in genetics clearly do not.) I am sufficiently puzzled to suspect I have missed the message of the paper.

“In various fields, statistical models of interest are analytically intractable. As a result, statistical inference is greatly hampered by computational constraint s. However, given a model, different users with different data are likely to perform similar computations. Computations done by one user are potentially useful for other users with different data sets. We propose a pooling of resources across researchers to capitalize on this. More specifically, we preemptively chart out the entire space of possible model outcomes in a prepaid database. Using advanced interpolation techniques, any individual estimation problem can now be solved on the spot. The prepaid method can easily accommodate different priors as well as constraints on the parameters. We created prepaid databases for three challenging models and demonstrate how they can be distributed through an online parameter estimation service. Our method outperforms state-of-the-art estimation techniques in both speed (with a 23,000 to 100,000-fold speed up) and accuracy, and is able to handle previously quasi inestimable models.”

I foresee potential difficulties with this proposal, like compelling all future users to rely on the same summary statistics, on the same prior distributions (the “representative amount of parameter values”), and requiring a massive storage capacity. Plus furthermore relying at its early stage on the most rudimentary form of an ABC algorithm (although not acknowledged as such), namely the rejection one. When reading the description in the paper, the proposed method indeed selects the parameters (simulated from a prior or a grid) that are producing pseudo-observations that are closest to the actual observations (or their summaries s). The subsample thus constructed is used to derive a (local) non-parametric or machine-learning predictor s=f(θ). From which a point estimator is deduced by minimising in θ a deviance d(s⁰,f(θ)).

The paper does not expand much on the theoretical justifications of the approach (including the appendix that covers a formal situation where the prepaid grid conveniently covers the observed statistics). And thus does not explain on which basis confidence intervals should offer nominal coverage for the prepaid method. Instead, the paper runs comparisons with Simon Wood’s (2010) synthetic likelihood maximisation (Ricker model with three parameters), the rejection ABC algorithm (species dispersion trait model with four parameters), while the Leaky Competing Accumulator (with four parameters as well) seemingly enjoys no alternative. Which is strange since the first step of the prepaid algorithm is an ABC step, but I am unfamiliar with this model. Unsurprisingly, in all these cases, given that the simulation has been done prior to the computing time for the prepaid method and not for either synthetic likelihood or ABC, the former enjoys a massive advantage from the start.

“The prepaid method can be used for a very large number of observations, contrary to the synthetic likelihood or ABC methods. The use of very large simulated data sets allows investigation of large-sample properties of the estimator”

To return to the general proposal and my major reservation or misunderstanding, for different experiments, the (true or pseudo-true) value of the parameter will not be the same, I presume, and hence the region of interest [or grid] will differ. While, again, the computational gain is de facto obvious [since the costly production of the reference table is not repeated], and, to repeat myself, makes the comparison with methods that do require a massive number of simulations from scratch massively in favour of the prepaid option, I do not see a convenient way of recycling these prepaid simulations for another setting, that is, when some experimental factors, sample size or collection, or even just the priors, do differ. Again, I may be missing the point, especially in a specific context like repeated psychological experiments.

While this may have some applications in reproducibility (but maybe not, if the goal is in fact to detect cherry-picking), I see very little use in repeating the same statistical model on different datasets. Even repeating observations will require additional nuisance parameters and possibly perturb the likelihood and/or posterior to large extents.

a book and three chapters on ABC

Posted in Statistics with tags , , , , , , , , , , on January 9, 2019 by xi'an

In connection with our handbook on mixtures being published, here are three chapters I contributed to from the Handbook of ABC, edited by Scott Sisson, Yanan Fan, and Mark Beaumont:

6. Likelihood-free Model Choice, by J.-M. Marin, P. Pudlo, A. Estoup and C.P. Robert

12. Approximating the Likelihood in ABC, by  C. C. Drovandi, C. Grazian, K. Mengersen and C.P. Robert

17. Application of ABC to Infer about the Genetic History of Pygmy Hunter-Gatherers Populations from Western Central Africa, by A. Estoup, P. Verdu, J.-M. Marin, C. Robert, A. Dehne-Garcia, J.-M. Cornuet and P. Pudlo

a good start in Series B!

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , on January 5, 2019 by xi'an

Just received the great news for the turn of the year that our paper on ABC using Wasserstein distance was accepted in Series B! Inference in generative models using the Wasserstein distance, written by Espen Bernton, Pierre Jacob, Mathieu Gerber, and myself, bypasses the (nasty) selection of summary statistics in ABC by considering the Wasserstein distance between observed and simulated samples. It focuses in particular on non-iid cases like time series in what I find fairly innovative ways. I am thus very glad the paper is going to appear in JRSS B, as it has methodological consequences that should appeal to the community at large.