Archive for Statistics

frontier of simulation-based inference

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , on June 11, 2020 by xi'an

“This paper results from the Arthur M. Sackler Colloquium of the National Academy of Sciences, `The Science of Deep Learning,’ held March 13–14, 2019, at the National Academy of Sciences in Washington, DC.”

A paper by Kyle Cranmer, Johann Brehmer, and Gilles Louppe just appeared in PNAS on the frontier of simulation-based inference. Sounding more like a tribune than a research paper producing new input. Or at least like a review. Providing a quick introduction to simulators, inference, ABC. Stating the shortcomings of simulation-based inference as three-folded:

  1. costly, since required a large number of simulated samples
  2. loosing information through the use of insufficient summary statistics or poor non-parametric approximations of the sampling density.
  3. wasteful as requiring new computational efforts for new datasets, primarily for ABC as learning the likelihood function (as a function of both the parameter θ and the data x) is only done once.

And the difficulties increase with the dimension of the data. While the points made above are correct, I want to note that ideally ABC (and Bayesian inference as a whole) only depends on a single dimension observation, which is the likelihood value. Or more practically that it only depends on the distance from the observed data to the simulated data. (Possibly the Wasserstein distance between the cdfs.) And that, somewhat unrealistically, that ABC could store the reference table once for all. Point 3 can also be debated in that the effort of learning an approximation can only be amortized when exactly the same model is re-employed with new data, which is likely in industrial applications but less in scientific investigations, I would think. About point 2, the paper misses part of the ABC literature on selecting summary statistics, e.g., the culling afforded by random forests ABC, or the earlier use of the score function in Martin et al. (2019).

The paper then makes a case for using machine-, active-, and deep-learning advances to overcome those blocks. Recouping other recent publications and talks (like Dennis on One World ABC’minar!). Once again presenting machine-learning techniques such as normalizing flows as more efficient than traditional non-parametric estimators. Of which I remain unconvinced without deeper arguments [than the repeated mention of powerful machine-learning techniques] on the convergence rates of these estimators (rather than extolling the super-powers of neural nets).

“A classifier is trained using supervised learning to discriminate two sets of data, although in this case both sets come from the simulator and are generated for different parameter points θ⁰ and θ¹. The classifier output function can be converted into an approximation of the likelihood ratio between θ⁰ and θ¹ (…) learning the likelihood or posterior is an unsupervised learning problem, whereas estimating the likelihood ratio through a classifier is an example of supervised learning and often a simpler task.”

The above comment is highly connected to the approach set by Geyer in 1994 and expanded in Gutmann and Hyvärinen in 2012. Interestingly, at least from my narrow statistician viewpoint!, the discussion about using these different types of approximation to the likelihood and hence to the resulting Bayesian inference never engages into a quantification of the approximation or even broaches upon the potential for inconsistent inference unlocked by using fake likelihoods. While insisting on the information loss brought by using summary statistics.

“Can the outcome be trusted in the presence of imperfections such as limited sample size, insufficient network capacity, or inefficient optimization?”

Interestingly [the more because the paper is classified as statistics] the above shows that the statistical question is set instead in terms of numerical error(s). With proposals to address it ranging from (unrealistic) parametric bootstrap to some forms of GANs.

un des aspects surprenants des analyses et des commentaires sur l’épidémie de Covid-19 est l’absence de la statistique

Posted in Statistics, University life with tags , , , , , , , on May 6, 2020 by xi'an

From one French demographer (INED) in Le Monde [my translation], with a clustering of French departments into three classes [the figures on the above map are the lags after the first death in Haut-Rhin]:

One of the surprising aspects of the analyses and commentaries on the Covid-19 epidemic is the absence of statistics. Every evening, however, we are bombarded with figures, and many sites, from Public Health France (SpF) to Johns-Hopkins University (Maryland), abound in data.

But a number carries a meaning only in reference to other figures. This is where the real statistics start. However, apart from comparing the number of contagions and deaths by country and date, little has been learned from the data, which could provide useful information on the nature and progression of the epidemic (…)

We can see that the diversity of close contacts is one of the keys to the evolution of the epidemic. Instead of reasoning on abstract coefficients such as the famous average number R⁰ of contagions per person, we should be able to delve into the details of these contagions. We see here that traffic axes, institutions and housing probably occupy a strategic position towards an explanation.

This analysis is inevitably limited to the nature of the data and their possible faults. It would be useful to collect more detailed information on the nature of the contacts of each new case of contagion and to analyze it, or even to carry out random surveys with Covid-19 test, in a word, to make the statistics.

assistant/associate professor position in statistics/machine-learning at ENSAE

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , on March 10, 2020 by xi'an

ENSAE (my Alma Mater) is opening a new position for next semester in statistics or/and machine-learning. At the Assistant Professor level, the position is for an initial three-year term, renewable for another three years, before the tenure evaluation. The school is located on the Université Paris-Saclay campus, only teaches at the Master and PhD levels, and the deadline for application is 31 March 2020. Details and contacts on the call page.

summer internships at Warwick

Posted in Kids, pictures, Statistics, Travel, University life with tags , , , , , on December 16, 2019 by xi'an

Xmas tree at UCL, with a special gift

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , on November 26, 2019 by xi'an

Ph.D. students at UCL Statistics have made this Xmas tree out of bound and unbound volumes of statistics journals, not too hard to spot (especially the Current Indexes which I abandoned when I left my INSEE office a few years ago). An invisible present under the tree is the opening of several positions, namely two permanent lectureships and two three-year research fellowships, all in Statistics or Applied Probability, with the fellowship deadline being the 1st of December 2019!