In connection with the official launch of the Alan Turing Institute (or ATI, of which Warwick is a partner), it funded an ATI Scoping workshop
yesterday a week ago in Warwick around the notion(s) of intractable likelihood(s) and how this could/should fit within the themes of the Institute [hence the scoping]. This is one among many such scoping workshops taking place at all partners, as reported on the ATI website. Workshop that was quite relaxed and great fun, if only for getting together with most people (and friends) in the UK interested in the topic. But also pointing out some new themes I had not previously though of as related to ilike. For instance, questioning the relevance of likelihood for inference and putting forward decision theory under model misspecification, connecting with privacy and ethics [hence making intractable “good”!], introducing uncertain likelihood, getting more into network models, RKHS as a natural summary statistic, swarm of solutions for consensus inference… (And thanks to Mark Girolami for this homage to the iconic LP of the Sex Pistols!, that I played maniacally all over 1978…) My own two-cents into the discussion were mostly variations of other discussions, borrowing from ABC (and ABC slides) to call for a novel approach to approximate inference:
Archive for ABC
In connection with the official launch of the Alan Turing Institute (or ATI, of which Warwick is a partner), it funded an ATI Scoping workshop
Keli Liu and Xiao-Li Meng completed a paper on the very nature of inference, to appear in The Annual Review of Statistics and Its Application. This paper or chapter is addressing a fundamental (and foundational) question on drawing inference based a sample on a new observation. That is, in making prediction. To what extent should the characteristics of the sample used for that prediction resemble those of the future observation? In his 1921 book, A Treatise on Probability, Keynes thought this similarity (or individualisation) should be pushed to its extreme, which led him to somewhat conclude on the impossibility of statistics and never to return to the field again. Certainly missing the incoming possibility of comparing models and selecting variables. And not building so much on the “all models are wrong” tenet. On the contrary, classical statistics use the entire data available and the associated model to run the prediction, including Bayesian statistics, although it is less clear how to distinguish between data and control there. Liu & Meng debate about the possibility of creating controls from the data alone. Or “alone” as the model behind always plays a capital role.
“Bayes and Frequentism are two ends of the same spectrum—a spectrum defined in terms of relevance and robustness. The nominal contrast between them (…) is a red herring.”
The paper makes for an exhilarating if definitely challenging read. With a highly witty writing style. If only because the perspective is unusual, to say the least!, and requires constant mental contortions to frame the assertions into more traditional terms. For instance, I first thought that Bayesian procedures were in agreement with the ultimate conditioning approach, since it conditions on the observables and nothing else (except for the model!). Upon reflection, I am not so convinced that there is such a difference with the frequentist approach in the (specific) sense that they both take advantage of the entire dataset. Either from the predictive or from the plug-in distribution. It all boils down to how one defines “control”.
“Probability and randomness, so tightly yoked in our minds, are in fact distinct concepts (…) at the end of the day, probability is essentially a tool for bookkeeping, just like the abacus.”
Some sentences from the paper made me think of ABC, even though I am not trying to bring everything back to ABC!, as drawing controls is the nature of the ABC game. ABC draws samples or control from the prior predictive and only keeps those for which the relevant aspects (or the summary statistics) agree with those of the observed data. Which opens similar questions about the validity and precision of the resulting inference, as well as the loss of information due to the projection over the summary statistics. While ABC is not mentioned in the paper, it can be used as a benchmark to walk through it.
“In the words of Jack Kiefer, we need to distinguish those problems with `luck data’ from those with `unlucky data’.”
I liked very much recalling discussions we had with George Casella and Costas Goutis in Cornell about frequentist conditional inference, with the memory of Jack Kiefer still lingering around. However, I am not so excited about the processing of models here since, from what I understand in the paper (!), the probabilistic model behind the statistical analysis must be used to some extent in producing the control case and thus cannot be truly assessed with a critical eye. For instance, of which use is the mean square error when the model behind is unable to produce the observed data? In particular, the variability of this mean squared error is directly driven by this model. Similarly the notion of ancillaries is completely model-dependent. In the classification diagrams opposing robustness to relevance, all methods included therein are parametric. While non-parametric types of inference could provide a reference or a calibration ruler, at the very least.
Also, by continuously and maybe a wee bit heavily referring to the doctor-and-patient analogy, the paper is somewhat confusing as to which parts are analogy and which parts are methodology and to which type of statistical problem is covered by the discussion (sometimes it feels like all problems and sometimes like medical trials).
“The need to deliver individualized assessments of uncertainty are more pressing than ever.”
A final question leads us to an infinite regress: if the statistician needs to turn to individualized inference, at which level of individuality should the statistician be assessed? And who is going to provide the controls then? In any case, this challenging paper is definitely worth reading by (only mature?) statisticians to ponder about the nature of the game!
As mentioned in a previous post, I had not pursued actively the organisation of an “ABC in…” workshop this year. I am thus very grateful to Jukka Corander, Samuel Kaski, Ritabrata Dutta, Michael Gutmann, from Helsinki, to have organised the next “ABC in…” workshop in Helsinki in the most possible exotic way, namely aboard a cruise ship going between Helsinki and Stockholm, on the Baltic Sea. Hence the appropriate ABCruise nickname. It will take place from May 16 to May 18, allowing for flying [from European cities] to Helsinki on the 16th and back from Helsinki on the 18th! While this may sound an inappropriate location for a meeting, we are constructing a complete scientific program with two days of talks [with a noon break in Stockholm], posters, and a total registration fee of 200€, including cabin and meals! (Which is clearly cheaper than having the same meeting on firm ground.) So, to all ‘Og’s readers interested in ABC topics, secure those dates in your agenda and keep posted for incoming updates on the program and the opening of registration.
By some piece of luck, I came upon the book Think Bayes: Bayesian Statistics Made Simple, written by Allen B. Downey and published by Green Tea Press [which I could relate to No Starch Press, focussing on coffee!, which published Statistics Done Wrong that I reviewed a while ago] which usually publishes programming books with fun covers. The book is available on-line for free in pdf and html formats, and I went through it during a particularly exciting administrative meeting…
“Most books on Bayesian statistics use mathematical notation and present ideas in terms of mathematical concepts like calculus. This book uses Python code instead of math, and discrete approximations instead of continuous mathematics. As a result, what would be an integral in a math book becomes a summation, and most operations on probability distributions are simple loops.”
The book is most appropriately published in this collection as most of it concentrates on Python programming, with hardly any maths formula. In some sense similar to Jim Albert’s R book. Obviously, coming from maths, and having never programmed in Python, I find the approach puzzling, But just as obviously, I am aware—both from the comments on my books and from my experience on X validated—that a large group (majority?) of newcomers to the Bayesian realm find the mathematical approach to the topic a major hindrance. Hence I am quite open to this editorial choice as it is bound to include more people to think Bayes, or to think they can think Bayes.
“…in fewer than 200 pages we have made it from the basics of probability to the research frontier. I’m very happy about that.”
The choice made of operating almost exclusively through motivating examples is rather traditional in US textbooks. See e.g. Albert’s book. While it goes against my French inclination to start from theory and concepts and end up with illustrations, I can see how it operates in a programming book. But as always I fear it makes generalisations uncertain and understanding more shaky… The examples are per force simple and far from realistic statistics issues. Hence illustrates more the use of Bayesian thinking for decision making than for data analysis. To wit, those examples are about the Monty Hall problem and other TV games, some urn, dice, and coin models, blood testing, sport predictions, subway waiting times, height variability between men and women, SAT scores, cancer causality, a Geiger counter hierarchical model inspired by Jaynes, …, the exception being the final Belly Button Biodiversity dataset in the final chapter, dealing with the (exciting) unseen species problem in an equally exciting way. This may explain why the book does not cover MCMC algorithms. And why ABC is covered through a rather artificial normal example. Which also hides some of the maths computations under the carpet.
“The underlying idea of ABC is that two datasets are alike if they yield the same summary statistics. But in some cases, like the example in this chapter, it is not obvious which summary statistics to choose.¨
In conclusion, this is a very original introduction to Bayesian analysis, which I welcome for the reasons above. Of course, it is only an introduction, which should be followed by a deeper entry into the topic, and with [more] maths. In order to handle more realistic models and datasets.
“This revision represents a very nice response to the earlier round of reviews, including a significant extension in which the posterior probability of the selected model is now estimated (whereas previously this was not included). The extension is a very nice one, and I am happy to see it included.” Anonymous
Great news [at least for us], our paper on ABC model choice has been accepted by Bioninformatics! With the pleasant comment above from one anonymous referee. This occurs after quite a prolonged gestation, which actually contributed to a shift in our understanding and our implementation of the method. I am still a wee bit unhappy at the rejection by PNAS, but it paradoxically led to a more elaborate article. So all is well that ends well! Except the story is not finished and we have still exploring the multiple usages of random forests in ABC.
While my train ride to the fabulous De Gaulle airport was so much delayed that I had less than ten minutes from jumping from the carriage to sitting in my plane seat, I handled the run through security and the endless corridors of the airport in the allotted time, and reached Munich in time for my afternoon seminar and several discussions that prolonged into a pleasant dinner of Wiener Schnitzel and Eisbier. This was very exciting as I met physicists and astrophysicists involved in population Monte Carlo and parallel MCMC and manageable harmonic mean estimates and intractable ABC settings (because simulating the data takes eons!). I wish the afternoon could have been longer. And while this is the third time I come to Munich, I still have not managed to see the centre of town! Or even the nearby mountains. Maybe an unsuspected consequence of the Heisenberg principle…
“The main task of this article is to construct low-dimensional and informative summary statistics for ABC methods.”
The idea in the paper “Learning Summary Statistic for ABC via Deep Neural Network”, arXived a few days ago, is to start from the raw data and build a “deep neural network” (meaning a multiple layer neural network) to provide a non-linear regression of the parameters over the data. (There is a rather militant tone to the justification of the approach, not that unusual with proponents of deep learning approaches, I must add…) Whose calibration never seems an issue. The neural construct is called to produce an estimator (function) of θ, θ(x). Which is then used as the summary statistics. Meaning, if Theorem 1 is to be taken as the proposal, that a different ABC needs to be run for every function of interest. Or, in other words, that the method is not reparameterisation invariant.
The paper claims to achieve the same optimality properties as in Fearnhead and Prangle (2012). These are however moderate optimalities in that they are obtained for the tolerance ε equal to zero. And using the exact posterior expectation as a summary statistic, instead of a non-parametric estimate. And an infinite functional basis in Theorem 2. I thus see little added value in results like Theorem 2 and no real optimality: That the ABC distribution can be arbitrarily close to the exact posterior is not an helpful statement when implementing the method.
The first example in the paper is the posterior distribution associated with the Ising model, which enjoys a sufficient statistic of dimension one. The issue of generating pseudo-data from the Ising model is evacuated by a call to a Gibbs sampler, but remains an intrinsic problem as the convergence of the Gibbs sampler depends on the value of the parameter θ and especially its location wrt the critical point. Both ABC posteriors are shown to be quite close.
The second example is the posterior distribution associated with an MA(2) model, apparently getting into a benchmark in the ABC literature. The comparison between an ABC based on the first two autocorrelations, an ABC based on the semi-automatic solution of Fearnhead and Prangle (2012) [for which collection of summaries?], and the neural network proposal, leads to the dismissal of the semi-automatic solution and the neural net being closest to the exact posterior [with the same tolerance quantile ε for all approaches].
A discussion crucially missing from the paper—from my perspective—is an accounting for size: First, what is the computing cost of fitting and calibrating and storing a neural network for the sole purpose of constructing a summary statistic? Once the neural net is constructed, I would assume most users would see little need in pursuing the experiment any further. (This was also why we stopped at our random forest output rather than using it as a summary statistic.) Second, how do cost and performances evolve as the dimension of the parameter θ grows? I would deem necessary to understand when the method fails. As for instance in latent variable models such as HMMs. Third, how does the size of the sample impact cost and performances? In many realistic cases when ABC applies, it is not possible to use the raw data, given its size, and summary statistics are a given. For such examples, neural networks should be compared with other ABC solutions, using the same reference table.