## Archive for Statistics

## a random day, in Paris

Posted in Statistics, University life with tags conference, IHP, Institut Henri Poincaré, journée aléatoire, probability theory, SFDS, SMAI, SMF, statistical inference, Statistics on September 28, 2022 by xi'an## ERC descriptors

Posted in Statistics, Travel, University life with tags biostatistics, Brussels, designators, econometrics, ERC, EU, Europe, European Commission, European Research Council, grant application, machine learning, signal processing, Statistics on November 9, 2020 by xi'an**H**ere are **the descriptors** (or keywords) validated by the (European Research Council) ERC for submitting grant proposal. The recent addition of PE1_15 in the Mathematics panel should help when submitting more methodological projects:

*PE1_14 Mathematical statistics**PE1_15 Generic statistical methodology and modelling**PE1_19 Scientific computing and data processing*

even though other panels could prove equally suited for some, as in Computer Science and Informatics,

*PE6_7 Artificial intelligence, intelligent systems, natural language processing**PE6_10 Web and information systems, data management systems, information retrieval and digital libraries, data fusion**PE6_11 Machine learning, statistical data processing and applications using signal processing (e.g. speech, image, video)**PE6_12 Scientific computing, simulation and modelling tools**PE6_13 Bioinformatics, bio-inspired computing, and natural computing*

in Systems and Communication Engineering,

*PE7_7 Signal processing*

in Integrative Biology,

*LS2_11 Bioinformatics and computational biology**LS2_12 Biostatistics*

in Prevention,Diagnosis and Treatment of Human Diseases,

*LS7_1 Medical imaging for prevention, diagnosis and monitoring of diseases**LS7_2 Medical technologies and tools (including genetic tools and biomarkers) for prevention, diagnosis, monitoring and treatment of diseases*

and in Social Sciences and Humanities,

*SH1_6 Econometrics; operations research**SH4_9 Theoretical linguistics; computational linguistics*

## frontier of simulation-based inference

Posted in Books, Statistics, University life with tags ABC, Bayesian deep learning, classification, deep learning, GANs, kernel density estimator, National Academy of Science, neural network, neural networks and learning machines, PNAS, simulation-based inference, Statistics, summary statistics, Wasserstein distance on June 11, 2020 by xi'an

“This paper results from the Arthur M. Sackler Colloquium of the National Academy of Sciences, `The Science of Deep Learning,’ held March 13–14, 2019, at the National Academy of Sciences in Washington, DC.”

**A** paper by Kyle Cranmer, Johann Brehmer, and Gilles Louppe just appeared in PNAS on the frontier of simulation-based inference. Sounding more like a tribune than a research paper producing new input. Or at least like a review. Providing a quick introduction to simulators, inference, ABC. Stating the shortcomings of simulation-based inference as three-folded:

- costly, since required a large number of simulated samples
- loosing information through the use of insufficient summary statistics or poor non-parametric approximations of the sampling density.
- wasteful as requiring new computational efforts for new datasets, primarily for ABC as learning the likelihood function (as a function of both the parameter θ and the data x) is only done once.

And the difficulties increase with the dimension of the data. While the points made above are correct, I want to note that ideally ABC (and Bayesian inference as a whole) only depends on a single dimension observation, which is the likelihood value. Or more practically that it only depends on the distance from the observed data to the simulated data. (Possibly the Wasserstein distance between the cdfs.) And that, somewhat unrealistically, that ABC could store the reference table once for all. Point 3 can also be debated in that the effort of learning an approximation can only be amortized when exactly the same model is re-employed with new data, which is likely in industrial applications but less in scientific investigations, I would think. About point 2, the paper misses part of the ABC literature on selecting summary statistics, e.g., the culling afforded by random forests ABC, or the earlier use of the score function in Martin et al. (2019).

The paper then makes a case for using machine-, active-, and deep-learning advances to overcome those blocks. Recouping other recent publications and talks (like Dennis on One World ABC’minar!). Once again presenting machine-learning techniques such as normalizing flows as more efficient than traditional non-parametric estimators. Of which I remain unconvinced without deeper arguments [than the repeated mention of powerful machine-learning techniques] on the convergence rates of these estimators (rather than extolling the super-powers of neural nets).

“A classifier is trained using supervised learning to discriminate two sets of data, although in this case both sets come from the simulator and are generated for different parameter points θ⁰ and θ¹. The classifier output function can be converted into an approximation of the likelihood ratio between θ⁰ and θ¹ (…) learning the likelihood or posterior is an unsupervised learning problem, whereas estimating the likelihood ratio through a classifier is an example of supervised learning and often a simpler task.”

The above comment is highly connected to the approach set by Geyer in 1994 and expanded in Gutmann and Hyvärinen in 2012. Interestingly, at least from my narrow statistician viewpoint!, the discussion about using these different types of approximation to the likelihood and hence to the resulting Bayesian inference never engages into a quantification of the approximation or even broaches upon the potential for inconsistent inference unlocked by using fake likelihoods. While insisting on the information loss brought by using summary statistics.

“Can the outcome be trusted in the presence of imperfections such as limited sample size, insufficient network capacity, or inefficient optimization?”

Interestingly [the more because the paper is classified as statistics] the above shows that the statistical question is set instead in terms of numerical error(s). With proposals to address it ranging from (unrealistic) parametric bootstrap to some forms of GANs.

## un des aspects surprenants des analyses et des commentaires sur l’épidémie de Covid-19 est l’absence de la statistique

Posted in Statistics, University life with tags coronavirus epidemics, COVID-19, demographics, dynamical system, France, INED, Le Monde, Statistics on May 6, 2020 by xi'an**F**rom one French demographer (INED) in Le Monde *[my translation]*, with a clustering of French departments into three classes [the figures on the above map are the lags after the first death in Haut-Rhin]:

One of the surprising aspects of the analyses and commentaries on the Covid-19 epidemic is the absence of statistics. Every evening, however, we are bombarded with figures, and many sites, from Public Health France (SpF) to Johns-Hopkins University (Maryland), abound in data.But a number carries a meaning only in reference to other figures. This is where the real statistics start. However, apart from comparing the number of contagions and deaths by country and date, little has been learned from the data, which could provide useful information on the nature and progression of the epidemic (…)

We can see that the diversity of close contacts is one of the keys to the evolution of the epidemic. Instead of reasoning on abstract coefficients such as the famous average number R⁰ of contagions per person, we should be able to delve into the details of these contagions. We see here that traffic axes, institutions and housing probably occupy a strategic position towards an explanation.This analysis is inevitably limited to the nature of the data and their possible faults. It would be useful to collect more detailed information on the nature of the contacts of each new case of contagion and to analyze it, or even to carry out random surveys with Covid-19 test, in a word, to make the statistics.

## assistant/associate professor position in statistics/machine-learning at ENSAE

Posted in pictures, Statistics, Travel, University life with tags École Polytechnique, ENSAE, France, job opening, machine learning, Palaiseau, Paris-Saclay campus, position, Statistics, Université Paris-Saclay on March 10, 2020 by xi'an**ENSAE** (my Alma Mater) is opening a new position for next semester in statistics or/and machine-learning. At the Assistant Professor level, the position is for an initial three-year term, renewable for another three years, before the tenure evaluation. The school is located on the Université Paris-Saclay campus, only teaches at the Master and PhD levels, and the deadline for application is 31 March 2020. Details and contacts on the call page.