**C**ongrats to Francis Bach, freshly nominated to the French Academy of Sciences, joining Stéphane Mallat²⁰¹⁴ and Éric Moulines²⁰¹⁷ as data science academicians!

## Archive for INRIA

## Francis Bach à l’Académie des Sciences

Posted in Statistics with tags Académie des Sciences, ENS, France, Francis Bach, INRIA, PSL on April 8, 2020 by xi'an## double descent

Posted in Books, Statistics, University life with tags double descent, France, Gare de Lyon, INRIA, machine learning, neural network, Paris, randomisation, Seine, SMILE seminar, stochastic gradient descent, training versus testing on November 7, 2019 by xi'an**L**ast Friday, I [and a few hundred others!] went to the SMILE (Statistical Machine Learning in Paris) seminar where Francis Bach was giving a talk. (With a pleasant ride from Dauphine along the Seine river.) Fancis was talking about the double descent phenomenon observed in recent papers by Belkin & al. (2018, 2019), and Mei & Montanari (2019). (As the seminar room at INRIA was quite crowded and as I was sitting X-legged on the floor close to the screen, I took a few slides from below!) The phenomenon is that the usual U curve warning about over-fitting and reproduced in most statistics and machine-learning courses can under the right circumstances be followed by a second decrease in the testing error when the number of features goes beyond the number of observations. This is rather puzzling and counter-intuitive, so I briefkly checked the 2019 [8 pages] article by Belkin & al., who are studying two examples, including a standard “large p small n” Gaussian regression. where the authors state that

“However, as p grows beyond n, the test risk again decreases, provided that the model is fit using a suitable inductive bias (e.g., least norm solution). “

One explanation [I found after checking the paper] is that the variates (features) in the regression are selected at random rather than in an optimal sequential order. Double descent is missing with interpolating and deterministic estimators. Hence requiring on principle all candidate variates to be included to achieve minimal averaged error. The infinite spike is when the number p of variate is near the number n of observations. (The expectation accounts as well for the randomisation in T. Randomisation that remains an unclear feature in this framework…)