**A**s I was in Paris and free for the occasion (!), I attended the Paris Statistics seminar this afternoon, in the Latin Quarter. With a first talk by Kweku Abraham on Bayesian inverse problems set a prior on the quantity of interest, γ, rather than its transform G(γ), observed with noise. Always perturbed by the juggling of different distances, like L² versus Kullback-Leibler, in non-parametric frameworks. Reminding me of probabilistic numerics, at least in the framework, since the crux of the talk was 100% about convergence. And a second talk by Leanaïc Chizat on convex neural networks corresponding to an infinite number of neurons, with surprising properties, including implicit bias. And a third talk by Anne Sabourin on PCA for extremes. Which assumed very little on the model but more on the geometry of the distribution, like extremes being concentrated on a subspace. As I was rather tired from an intense week at Warwick, and after a weekend of reading grant applications and Biometrika submissions (!), my foggy brain kept switching to these proposals, trying to make connections with the talks, not completely inappropriately in two cases out of three. (I am afraid the same may happen tomorrow at our probability seminar on computer-based proofs!)

## Archive for stochastic gradient descent

## séminaire P de S

Posted in Books, pictures, Statistics, University life with tags Biometrika, computer-based proof, extreme value theory, Institut Henri Poincaré, neural network, Paris, probabilistic numerics, séminaire, seminar, stochastic gradient descent on February 18, 2020 by xi'an## double descent

Posted in Books, Statistics, University life with tags double descent, France, Gare de Lyon, INRIA, machine learning, neural network, Paris, randomisation, Seine, SMILE seminar, stochastic gradient descent, training versus testing on November 7, 2019 by xi'an**L**ast Friday, I [and a few hundred others!] went to the SMILE (Statistical Machine Learning in Paris) seminar where Francis Bach was giving a talk. (With a pleasant ride from Dauphine along the Seine river.) Fancis was talking about the double descent phenomenon observed in recent papers by Belkin & al. (2018, 2019), and Mei & Montanari (2019). (As the seminar room at INRIA was quite crowded and as I was sitting X-legged on the floor close to the screen, I took a few slides from below!) The phenomenon is that the usual U curve warning about over-fitting and reproduced in most statistics and machine-learning courses can under the right circumstances be followed by a second decrease in the testing error when the number of features goes beyond the number of observations. This is rather puzzling and counter-intuitive, so I briefkly checked the 2019 [8 pages] article by Belkin & al., who are studying two examples, including a standard “large p small n” Gaussian regression. where the authors state that

“However, as p grows beyond n, the test risk again decreases, provided that the model is fit using a suitable inductive bias (e.g., least norm solution). “

One explanation [I found after checking the paper] is that the variates (features) in the regression are selected at random rather than in an optimal sequential order. Double descent is missing with interpolating and deterministic estimators. Hence requiring on principle all candidate variates to be included to achieve minimal averaged error. The infinite spike is when the number p of variate is near the number n of observations. (The expectation accounts as well for the randomisation in T. Randomisation that remains an unclear feature in this framework…)

## Bayesian webinar: Bayesian conjugate gradient

Posted in Statistics with tags Bayesian Analysis, Greenwich Mean Time, ISBA, stochastic gradient descent, University of Warwick, webinar, youtube on September 25, 2019 by xi'an**B**ayesian Analysis is launching its webinar series on discussion papers! Meaning the first 90 registrants will be able to participate interactively via the Zoom Conference platform while additional registrants will be able to view the Webinar on a dedicated YouTube Channel. This fantastic initiative is starting with the Bayesian conjugate gradient method of Jon Cockayne (University of Warwick) et al., on October 2 at 4pm Greenwich time. (With available equivalences for other time zones!) I strongly support this initiative and wish it the widest possible success, as it could bring a new standard for conferences, having distant participants gathering in a nearby location to present talks and attend other talks from another part of the World, while effectively participating. An dense enough network could even see the non-stop conference emerging!

## JSM 2018 [#3]

Posted in Mountains, Statistics, Travel, University life with tags ABC, Approximate Bayesian computation, Bayesian network, Bayesian p-values, British Columbia, Canada, curse of dimensionality, JSM 2018, prior predictive, pseudo-marginal MCMC, spectral analysis, spike-and-slab prior, stochastic gradient descent, Vancouver, variational Bayes methods on August 1, 2018 by xi'an**A**s I skipped day #2 for climbing, here I am on day #3, attending JSM 2018, with a [fully Canadian!] session on (conditional) copula (where Bruno Rémillard talked of copulas for mixed data, with unknown atoms, which sounded like an impossible target!), and another on four highlights from Bayesian Analysis, (the journal), with Maria Terres defending the (often ill-considered!) spectral approach within Bayesian analysis, modelling spectral densities (Fourier transforms of correlations functions, not probability densities), an advantage compared with MCAR modelling being the automated derivation of dependence graphs. While the spectral ghost did not completely dissipate for me, the use of DIC that she mentioned at the very end seems to call for investigation as I do not know of well-studied cases of complex dependent data with clearly specified DICs. Then Chris Drobandi was speaking of ABC being used for prior choice, an idea I vaguely remember seeing quite a while ago as a referee (or another paper!), paper in BA that I missed (and obviously did not referee). Using the same reference table works (for simple ABC) with different datasets but also different priors. I did not get first the notion that the reference table also produces an evaluation of the marginal distribution but indeed the entire simulation from prior x generative model gives a Monte Carlo representation of the marginal, hence the evidence at the observed data. Borrowing from Evans’ fringe Bayesian approach to model choice by prior predictive check for prior-model conflict. I remain sceptic or at least agnostic on the notion of using data to compare priors. And here on using ABC in tractable settings.

The afternoon session was [a mostly Australian] Advanced Bayesian computational methods, with Robert Kohn on variational Bayes, with an interesting comparison of (exact) MCMC and (approximative) variational Bayes results for some species intensity and the remark that forecasting may be much more tolerant to the approximation than estimation. Making me wonder at a possibility of assessing VB on the marginals manageable by MCMC. Unless I miss a complexity such that the decomposition is impossible. And Antonietta Mira on estimating time-evolving networks estimated by ABC (which Anto first showed me in Orly airport, waiting for her plane!). With a possibility of a zero distance. Next talk by Nadja Klein on impicit copulas, linked with shrinkage properties I was unaware of, including the case of spike & slab copulas. Michael Smith also spoke of copulas with discrete margins, mentioning a version with continuous latent variables (as I thought could be done during the first session of the day), then moving to variational Bayes which sounds quite popular at JSM 2018. And David Gunawan made a presentation of a paper mixing pseudo-marginal Metropolis with particle Gibbs sampling, written with Chris Carter and Robert Kohn, making me wonder at their feature of using the white noise as an auxiliary variable in the estimation of the likelihood, which is quite clever but seems to get against the validation of the pseudo-marginal principle. *(Warning: I have been known to be wrong!)*

## JSM 2018 [#1]

Posted in Mountains, Statistics, Travel, University life with tags British Columbia, Canada, curse of dimensionality, deep learning, GANs, JSM 2018, overfitting, regularisation, sparsity, stochastic gradient descent, Vancouver on July 30, 2018 by xi'an**A**s our direct flight from Paris landed in the morning in Vancouver, we found ourselves in the unusual situation of a few hours to kill before accessing our rental and where else better than a general introduction to deep learning in the first round of sessions at JSM2018?! In my humble opinion, or maybe just because it was past midnight in Paris time!, the talk was pretty uninspiring in missing the natural question of the possible connections between the construction of a prediction function and statistics. Watching improving performances at classifying human faces does not tell much more than creating a massively non-linear function in high dimensions with nicely designed error penalties. Most of the talk droned about neural networks and their fitting by back-propagation and the variations on stochastic gradient descent. Not addressing much rather natural (?) questions about choice of functions at each level, of the number of levels, of the penalty term, or regulariser, and even less the reason why no sparsity is imposed on the structure, despite the humongous number of parameters involved. What came close [but not that close] to sparsity is the notion of dropout, which is a sort of purely automated culling of the nodes, and which was new to me. More like a sort of randomisation that turns the optimisation criterion in an average. Only at the end of the presentation more relevant questions emerged, presenting unsupervised learning as density estimation, the pivot being the generative features of (most) statistical models. And GANs of course. But nonetheless missing an explanation as to why models with massive numbers of parameters can be considered in this setting and not in standard statistics. (One slide about deterministic auto-encoders was somewhat puzzling in that it seemed to repeat the “fiducial mistake”.)

## Langevin on a wrong bend

Posted in Books, Statistics with tags CREST, Hastings-Metropolis sampler, Langevin diffusion, MALA, pseudo-marginal MCMC, scalable MCMC, stochastic gradient descent, Wasserstein distance on October 19, 2017 by xi'an**A**rnak Dalayan and Avetik Karagulyan (CREST) arXived a paper the other week on a focussed study of the Langevin algorithm [not MALA] when the gradient of the target is incorrect. With the following improvements *[quoting non-verbatim from the paper]*:

- a varying-step Langevin that reduces the number of iterations for a given Wasserstein precision, compared with recent results by e.g. Alan Durmus and Éric Moulines;
- an extension of convergence results for error-prone evaluations of the gradient of the target (i.e., the gradient is replaced with a noisy version, under some moment assumptions that do not include unbiasedness);
- a new second-order sampling algorithm termed LMCO’, with improved convergence properties.

What is particularly interesting to me in this setting is the use in all these papers of a discretised Langevin diffusion (a.k.a., random walk with a drift induced by the gradient of the log-target) without the original Metropolis correction. The results rely on an assumption of [strong?] log-concavity of the target, with “user-friendly” bounds on the Wasserstein distance depending on the constants appearing in this log-concavity constraint. And so does the adaptive step. (In the case of the noisy version, the bias and variance of the noise also matter. As pointed out by the authors, there is still applicability to scaling MCMC for large samples. Beyond pseudo-marginal situations.)

“…this, at first sight very disappointing behavior of the LMC algorithm is, in fact, continuously connected to the exponential convergence of the gradient descent.”

The paper concludes with an interesting mise en parallèle of Langevin algorithms and of gradient descent algorithms, since the convergence rates are the same.

## the invasion of the stochastic gradients

Posted in Statistics with tags approximate inference, arXiv, Euler discretisation, population Monte Carlo, RKHS, scalable MCMC, stochastic gradient, stochastic gradient descent, variational Bayes methods, Wales on May 10, 2017 by xi'an**W**ithin the same day, I spotted three submissions to arXiv involving stochastic gradient descent, that I briefly browsed on my trip back from Wales:

- Stochastic Gradient Descent as Approximate Bayesian inference, by Mandt, Hoffman, and Blei, where this technique is used as a type of variational Bayes method, where the minimum Kullback-Leibler distance to the true posterior can be achieved. Rephrasing the [scalable] MCMC algorithm of Welling and Teh (2011) as such an approximation.
- Further and stronger analogy between sampling and optimization: Langevin Monte Carlo and gradient descent, by Arnak Dalalyan, which establishes a convergence of the uncorrected Langevin algorithm to the right target distribution in the sense of the Wasserstein distance. (Uncorrected in the sense that there is no Metropolis step, meaning this is a Euler approximation.) With an extension to the noisy version, when the gradient is approximated eg by subsampling. The connection with stochastic gradient descent is thus tenuous, but Arnak explains the somewhat disappointing rate of convergence as being in agreement with optimisation rates.
- Stein variational adaptive importance sampling, by Jun Han and Qiang Liu, which relates to our population Monte Carlo algorithm, but as a non-parametric version, using RKHS to represent the transforms of the particles at each iteration. The sampling method follows two threads of particles, one that is used to estimate the transform by a stochastic gradient update, and another one that is used for estimation purposes as in a regular population Monte Carlo approach. Deconstructing into those threads allows for conditional independence that makes convergence easier to establish. (A problem we also hit when working on the AMIS algorithm.)