“…many well-known learning-algorithms, such as those used in optimization, deep learning, and machine learning in general, can now be derived directly following the above scheme using a single algorithm”
The One World ABC webinar today was delivered by Emtiyaz Khan (RIKEN), about the Bayesian Learning Rule, following Khan and Rue 2021 arXival on Bayesian learning. (It had a great intro featuring a video of the speaker’s daughter learning about the purpose of a ukulele in her first year!) The paper argues about a Bayesian interpretation/version of gradient descent algorithms, starting with Zellner’s (1988, the year I first met him!) identity that the posterior is solution to
when ℓ is the likelihood and π the prior. This identity can be generalised to an arbitrary loss function (also dependent on the data) replacing the likelihood and considered for a posterior chosen within an exponential family just as variational Bayes. Ending up with a posterior adapted to this target (in the KL sense). The optimal hyperparameter or pseudo-hyperparameter of this approximation can be recovered by some gradient algorithm, recovering as well stochastic gradient and Newton’s methods. While constructing a prior out of a loss function would have pleased the late Herman Rubin, this is not the case, but rater an approach to deriving a generalised Bayes distribution within a parametric family, including mixtures of Gaussians. At some point in the talk, the uncertainty endemic to the Bayesian approach seeped back into the picture, but since most of the intuition came from machine learning, I was somewhat lost at the nature of this uncertainty.