Archive for Bayesian deep learning

the Bayesian learning rule [One World ABC’minar, 27 April]

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on April 24, 2023 by xi'an

The next One World ABC seminar is taking place (on-line, requiring pre-registration) on 27 April, 9:30am UK time, with Mohammad Emtiyaz Khan (RIKEN-AIP, Tokyo) speaking about the Bayesian learning rule:

We show that many machine-learning algorithms are specific instances of a single algorithm called the Bayesian learning rule. The rule, derived from Bayesian principles, yields a wide-range of algorithms from fields such as optimization, deep learning, and graphical models. This includes classical algorithms such as ridge regression, Newton’s method, and Kalman filter, as well as modern deep-learning algorithms such as stochastic-gradient descent, RMSprop, and Dropout. The key idea in deriving such algorithms is to approximate the posterior using candidate distributions estimated by using natural gradients. Different candidate distributions result in different algorithms and further approximations to natural gradients give rise to variants of those algorithms. Our work not only unifies, generalizes, and improves existing algorithms, but also helps us design new ones.

more air for MCMC

Posted in Books, R, Statistics with tags , , , , , , , , , , , , , , on May 30, 2021 by xi'an

Aki Vehtari, Andrew Gelman, Dan Simpson, Bob Carpenter, and Paul-Christian Bürkner have just published a Bayesian Analysis paper about using an improved R factor for MCMC convergence assessment. From the early days of MCMC, convergence assessment has been a recurring (and recurrent!) question in the community. First leading to a flurry of proposals, [which Kerrie, Chantal, and myself reviewwwed in the Valencia 1998 proceedings], and then slowly disintegrating under the onslaughts of reality—i.e. that none could not be 100% foolproof in full generality—…. This included the (possibly now forgotten) single-versus-multiple-chains debate between Charlie Geyer [for single] and Andrew Gelman and Don Rubin [for multiple]. The later introduced an analysis-of-variance R factor, which remains quite popular up to this day, in part for being part of most MCMC software, like BUGS. That this R may fail to identify convergence issues, even in the more recent split version, does not come as a major surprise, since any situation with a long-term influence of the starting distribution may well fail to identify missing (significant) parts of the posterior support. (It is thus somewhat disconcerting to me to see that the main recommendation is to move the bound on R from 1.1 to 1.01, reminding me to some extent of a recent proposal to move the null rejection boundary from 0.05 to 0.005…) Similarly, the ESS may prove a poor signal for convergence or lack thereof, especially because the approximation of the asymptotic variance relies on stationarity assumptions. While multiplying the monitoring tools (as in CODA) helps with identifying convergence issues, looking at a single convergence indicator is somewhat like looking only at a frequentist estimator! (And with greater automation comes greater responsibility—in keeping a critical perspective.)

Looking for a broader perspective, I thus wonder at what we would instead need to assess the lack of convergence of an MCMC chain without much massaging of the said chain. An evaluation of the (Kullback, Wasserstein, or else) distance between the distribution of the chain at iteration n or across iterations, and the true target? A percentage of the mass of the posterior visited so far, which relates to estimating the normalising constant, with a relatively vast array of solutions made available in the recent years? I remain perplexed and frustrated by the fact that, 30 years later, the computed values of the visited likelihoods are not better exploited. Through for instance machine-learning approximations of the target. that could themselves be utilised for approximating the normalising constant and potential divergences from other approximations.

frontier of simulation-based inference

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , on June 11, 2020 by xi'an

“This paper results from the Arthur M. Sackler Colloquium of the National Academy of Sciences, `The Science of Deep Learning,’ held March 13–14, 2019, at the National Academy of Sciences in Washington, DC.”

A paper by Kyle Cranmer, Johann Brehmer, and Gilles Louppe just appeared in PNAS on the frontier of simulation-based inference. Sounding more like a tribune than a research paper producing new input. Or at least like a review. Providing a quick introduction to simulators, inference, ABC. Stating the shortcomings of simulation-based inference as three-folded:

  1. costly, since required a large number of simulated samples
  2. loosing information through the use of insufficient summary statistics or poor non-parametric approximations of the sampling density.
  3. wasteful as requiring new computational efforts for new datasets, primarily for ABC as learning the likelihood function (as a function of both the parameter θ and the data x) is only done once.

And the difficulties increase with the dimension of the data. While the points made above are correct, I want to note that ideally ABC (and Bayesian inference as a whole) only depends on a single dimension observation, which is the likelihood value. Or more practically that it only depends on the distance from the observed data to the simulated data. (Possibly the Wasserstein distance between the cdfs.) And that, somewhat unrealistically, that ABC could store the reference table once for all. Point 3 can also be debated in that the effort of learning an approximation can only be amortized when exactly the same model is re-employed with new data, which is likely in industrial applications but less in scientific investigations, I would think. About point 2, the paper misses part of the ABC literature on selecting summary statistics, e.g., the culling afforded by random forests ABC, or the earlier use of the score function in Martin et al. (2019).

The paper then makes a case for using machine-, active-, and deep-learning advances to overcome those blocks. Recouping other recent publications and talks (like Dennis on One World ABC’minar!). Once again presenting machine-learning techniques such as normalizing flows as more efficient than traditional non-parametric estimators. Of which I remain unconvinced without deeper arguments [than the repeated mention of powerful machine-learning techniques] on the convergence rates of these estimators (rather than extolling the super-powers of neural nets).

“A classifier is trained using supervised learning to discriminate two sets of data, although in this case both sets come from the simulator and are generated for different parameter points θ⁰ and θ¹. The classifier output function can be converted into an approximation of the likelihood ratio between θ⁰ and θ¹ (…) learning the likelihood or posterior is an unsupervised learning problem, whereas estimating the likelihood ratio through a classifier is an example of supervised learning and often a simpler task.”

The above comment is highly connected to the approach set by Geyer in 1994 and expanded in Gutmann and Hyvärinen in 2012. Interestingly, at least from my narrow statistician viewpoint!, the discussion about using these different types of approximation to the likelihood and hence to the resulting Bayesian inference never engages into a quantification of the approximation or even broaches upon the potential for inconsistent inference unlocked by using fake likelihoods. While insisting on the information loss brought by using summary statistics.

“Can the outcome be trusted in the presence of imperfections such as limited sample size, insufficient network capacity, or inefficient optimization?”

Interestingly [the more because the paper is classified as statistics] the above shows that the statistical question is set instead in terms of numerical error(s). With proposals to address it ranging from (unrealistic) parametric bootstrap to some forms of GANs.

Julyan’s talk on priors in Bayesian neural networks [cancelled!]

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , on March 5, 2020 by xi'an

Next Friday, 13 March at 1:30p.m., Julyan Arbel, researcher at Inria Grenoble will give a All about that Bayes talk at CMLA, ENS Paris-Saclay (building D’Alembert, room Condorcet, Cachan, RER stop Bagneux) on

Understanding Priors in Bayesian Neural Networks at the Unit Level

We investigate deep Bayesian neural networks with Gaussian weight priors and a class of ReLU-like nonlinearities. Bayesian neural networks with Gaussian priors are well known to induce an L², “weight decay”, regularization. Our results characterize a more intricate regularization effect at the level of the unit activations. Our main result establishes that the induced prior distribution on the units before and after activation becomes increasingly heavy-tailed with the depth of the layer. We show that first layer units are Gaussian, second layer units are sub-exponential, and units in deeper layers are characterized by sub-Weibull distributions. Our results provide new theoretical insight on deep Bayesian neural networks, which we corroborate with simulation experiments.

 

deep learning in Toulouse [post-doc position]

Posted in pictures, Travel, University life with tags , , , , , , , , , , on April 25, 2019 by xi'an

An opening for an ERC post-doc position on Bayesian deep learning with Cédric Févotte in Toulouse.

%d bloggers like this: