Archive for stochastic gradient

AABI9 tidbits [& misbits]

Posted in Books, Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , on December 10, 2019 by xi'an

Today’s Advances in Approximate Bayesian Inference symposium, organised by Thang Bui, Adji Bousso Dieng, Dawen Liang, Francisco Ruiz, and Cheng Zhang, took place in front of Vancouver Harbour (and the tentalising ski slope at the back) and saw more than 400 participants, drifting away from the earlier versions which had a stronger dose of ABC and much fewer participants. There were students’ talks in a fair proportion, as well (and a massive number of posters). As of below, I took some notes during some of the talks with no pretense at exhaustivity, objectivity or accuracy. (This is a blog post, remember?!) Overall I found the day exciting (to the point I did not suffer at all from the usal naps consecutive to very short nights!) and engaging, with a lot of notions and methods I had never heard about. (Which shows how much I know nothing!)

The first talk was by Michalis Titsias, Gradient-based Adaptive Markov Chain Monte Carlo (jointly with Petros Dellaportas) involving as its objective function the multiplication of the variance of the move and of the acceptance probability, with a proposed adaptive version merging gradients, variational Bayes, neurons, and two levels of calibration parameters. The method advocates using this construction in a burnin phase rather than continuously, hence does not require advanced Markov tools for convergence assessment. (I found myself less excited by adaptation than earlier, maybe because it seems like switching one convergence problem for another, with additional design choices to be made.)The second talk was by Jakub Swiatkowsk, The k-tied Normal Distribution: A Compact Parameterization of Gaussian Mean Field Posteriors in Bayesian Neural Networks, involving mean field approximation in variational inference (loads of VI at this symposium!), meaning de facto searching for a MAP estimator, and reminding me of older factor analysis and other analyse de données projection methods, except it also involved neural networks (what else at NeurIPS?!)The third talk was by Michael Gutmann, Robust Optimisation Monte Carlo, (OMC) for implicit data generated models (Diggle & Graton, 1982), an ABC talk at last!, using a formalisation through the functional representation of the generative process and involving derivatives of the summary statistic against parameter, in that sense, with the (Bayesian) random nature of the parameter sample only induced by the (frequentist) randomness in the generative transform since a new parameter “realisation” is obtained there as the one providing minimal distance between data and pseudo-data, with no uncertainty or impact of the prior. The Jacobian of this summary transform (and once again a neural network is used to construct the summary) appears in the importance weight, leading to OMC being unstable, beyond failing to reproduce the variability expressed by the regular posterior or even the ABC posterior. It took me a while to wonder `where is Wally?!’ (the prior) as it only appears in the importance weight.

The fourth talk was by Sergey Levine, Reinforcement Learning, Optimal , Control, and Probabilistic Inference, back to Kullback-Leibler as the objective function, with linkage to optimal control (with distributions as actions?), plus again variational inference, producing an approximation in sequential settings. This sounded like a type of return of the MaxEnt prior, but the talk pace was so intense that I could not follow where the innovations stood.

The fifth talk was by Iuliia Molchanova, on Structured Semi-Implicit Variational Inference, from BAyesgroup.ru (I did not know of a Bayesian group in Russia!, as I was under the impression that Bayesian statistics were under-represented there, but apparently the situation is quite different in machine learning.) The talk brought an interesting concept of semi-implicit variational inference, exploiting some form of latent variables as far as I can understand, using mixtures of Gaussians.

The sixth talk was by Rianne van den Berg, Normalizing Flows for Discrete Data, and amounted to covering three papers also discussed in NeurIPS 2019 proper, which I found somewhat of a suboptimal approach to an invited talk, as it turned into a teaser for following talks or posters. But the teasers it contained were quite interesting as they covered normalising flows as integer valued controlled changes of variables using neural networks about which I had just became aware during the poster session, in connection with papers of Papamakarios et al., which I need to soon read.

The seventh talk was by Matthew Hoffman: Langevin Dynamics as Nonparametric Variational Inference, and sounded most interesting, both from title and later reports, as it was bridging Langevin with VI, but I alas missed it for being “stuck” in a tea-house ceremony that lasted much longer than expected. (More later on that side issue!)

After the second poster session (with a highly original proposal by Radford Neal towards creating  non-reversibility at the level of the uniform generator rather than later on), I thus only attended Emily Fox’s Stochastic Gradient MCMC for Sequential Data Sources, which superbly reviewed (in connection with a sequence of papers, including a recent one by Aicher et al.) error rate and convergence properties of stochastic gradient estimator methods there. Another paper I need to soon read!

The one before last speaker, Roman Novak, exposed a Python library about infinite neural networks, for which I had no direct connection (and talks I have always difficulties about libraries, even without a four hour sleep night) and the symposium concluded with a mild round-table. Mild because Frank Wood’s best efforts (and healthy skepticism about round tables!) to initiate controversies, we could not see much to bite from each other’s viewpoint.

MCM17 snapshots

Posted in Kids, Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , on July 5, 2017 by xi'an

At MCM2017 today, Radu Craiu presented a talk on adaptive Metropolis-within-Gibbs, using a family of proposals for each component of the target and weighting them by jumping distance. And managing the adaptation from the selection rate rather than from the acceptance rate as we did in population Monte Carlo. I find the approach quite interesting in that adaptation and calibration of Metropolis-within-Gibbs is quite challenging due to the conditioning, i.e., the optimality of one scale is dependent on the other components. Some of the graphs produced by Radu during the talk showed a form of local adaptivity that seemed promising. This raised a question I could not ask for lack of time, namely that with a large enough collection of proposals, it is unclear why this approach provides a gain compared with particle, sequential or population Monte Carlo algorithms. Indeed, when there are many parallel proposals, clouds of particles can be generated from all proposals in proportion to their appeal and merged together in an importance manner, leading to an easier adaptation. As it went, the notion of local scaling also reflected in Mylène Bédard’s talk on another Metropolis-within-Gibbs study of optimal rates. The other interesting sessions I attended were the ones on importance sampling with stochastic gradient optimisation, organised by Ingmar Schuster, and on sequential Monte Carlo, with a divide-and-conquer resolution through trees by Lindsten et al. I had missed.

the invasion of the stochastic gradients

Posted in Statistics with tags , , , , , , , , , on May 10, 2017 by xi'an

Within the same day, I spotted three submissions to arXiv involving stochastic gradient descent, that I briefly browsed on my trip back from Wales:

  1. Stochastic Gradient Descent as Approximate Bayesian inference, by Mandt, Hoffman, and Blei, where this technique is used as a type of variational Bayes method, where the minimum Kullback-Leibler distance to the true posterior can be achieved. Rephrasing the [scalable] MCMC algorithm of Welling and Teh (2011) as such an approximation.
  2. Further and stronger analogy between sampling and optimization: Langevin Monte Carlo and gradient descent, by Arnak Dalalyan, which establishes a convergence of the uncorrected Langevin algorithm to the right target distribution in the sense of the Wasserstein distance. (Uncorrected in the sense that there is no Metropolis step, meaning this is a Euler approximation.) With an extension to the noisy version, when the gradient is approximated eg by subsampling. The connection with stochastic gradient descent is thus tenuous, but Arnak explains the somewhat disappointing rate of convergence as being in agreement with optimisation rates.
  3. Stein variational adaptive importance sampling, by Jun Han and Qiang Liu, which relates to our population Monte Carlo algorithm, but as a non-parametric version, using RKHS to represent the transforms of the particles at each iteration. The sampling method follows two threads of particles, one that is used to estimate the transform by a stochastic gradient update, and another one that is used for estimation purposes as in a regular population Monte Carlo approach. Deconstructing into those threads allows for conditional independence that makes convergence easier to establish. (A problem we also hit when working on the AMIS algorithm.)

MCM 2017

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , on February 10, 2017 by xi'an

Je reviendrai à Montréal, as the song by Robert Charlebois goes, for the MCM 2017 meeting there, on July 3-7. I was invited to give a plenary talk by the organisers of the conference . Along with

Steffen Dereich, WWU Münster, Germany
Paul Dupuis, Brown University, Providence, USA
Mark Girolami, Imperial College London, UK
Emmanuel Gobet, École Polytechnique, Palaiseau, France
Aicke Hinrichs, Johannes Kepler University, Linz, Austria
Alexander Keller, NVIDIA Research, Germany
Gunther Leobacher, Johannes Kepler University, Linz, Austria
Art B. Owen, Stanford University, USA

Note that, while special sessions are already selected, including oneon Stochastic Gradient methods for Monte Carlo and Variational Inference, organised by Victor Elvira and Ingmar Schuster (my only contribution to this session being the suggestion they organise it!), proposals for contributed talks will be selected based on one-page abstracts, to be submitted by March 1.

Michael Jordan’s seminar in Paris next week

Posted in Statistics, University life with tags , , , , , on June 3, 2016 by xi'an

Next week, on June 7, at 4pm, Michael will give a seminar at INRIA, rue du Charolais, Paris 12 (map). Here is the abstract:

A Variational Perspective on Accelerated Methods in Optimization

Accelerated gradient methods play a central role in optimization,achieving optimal rates in many settings. While many generalizations and extensions of Nesterov’s original acceleration method have been proposed,it is not yet clear what is the natural scope of the acceleration concept.In this paper, we study accelerated methods from a continuous-time perspective. We show that there is a Lagrangian functional that we call the Bregman Lagrangian which generates a large class of accelerated methods in continuous time, including (but not limited to) accelerated gradient descent, its non-Euclidean extension, and accelerated higher-order gradient methods. We show that the continuous-time limit of all of these methods correspond to travelling the same curve in space time at different speeds, and in this sense the continuous-time setting is the natural one for understanding acceleration.  Moreover, from this perspective, Nesterov’s technique and many of its generalizations can be viewed as a systematic way to go from the continuous-time curves generated by the Bregman Lagrangian to a family of discrete-time accelerated algorithms. [Joint work with Andre Wibisono and Ashia Wilson.]

(Interested readers need to register to attend the lecture.)

Approximate Maximum Likelihood Estimation

Posted in Books, Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , , on September 21, 2015 by xi'an

linz3Bertl et al. arXived last July a paper on a maximum likelihood estimator based on an alternative to ABC techniques. And to indirect inference. (One of the authors in et al. is Andreas Futschik whom I visited last year in Linz.) Paper that I only spotted when gathering references for a reading list on ABC… The method is related to the “original ABC paper” of Diggle and Gratton (1984) which, parallel to Rubin (1984), contains in retrospect the idea of ABC methods. The starting point is stochastic approximation, namely the optimisation of a function of a parameter θ when written as an expectation of a random variable Y, E[Y|θ], as in the Kiefer-Wolfowitz algorithm. However, in the case of the likelihood function, there is rarely an unbiased estimator and the authors propose instead to use a kernel density estimator of the density of the summary statistic. This means that, at each iteration of the Kiefer-Wolfowitz algorithm, two sets of observations and hence of summary statistics are simulated and two kernel density estimates derived, both to be applied to the observed summary. The sequences underlying the Kiefer-Wolfowitz algorithm are taken from (the excellent optimisation book of) Spall (2003). Along with on-the-go adaptation and convergence test.

The theoretical difficulty in this extension is however that the kernel density estimator is not unbiased and thus that, rigorously speaking, the validation of the Kiefer-Wolfowitz algorithm does not apply here. On the practical side, the need for multiple starting points and multiple simulations of pseudo-samples may induce considerable time overload. Especially if  bootstrap is used to evaluate the precision of the MLE approximation. Besides normal and M/G/1 queue examples, the authors illustrate the approach on a population genetic dataset of Borneo and Sumatra orang-utans. With 5 parameters and 28 summary statistics. Which thus means using a kernel density estimator in dimension 28, a rather perilous adventure..!

Bangalore workshop [ಬೆಂಗಳೂರು ಕಾರ್ಯಾಗಾರ]

Posted in pictures, Running, Statistics, Travel, University life, Wines with tags , , , , , on July 30, 2014 by xi'an

iiscFirst day at the Indo-French Centre for Applied Mathematics and the get-together (or speed-dating!) workshop. The campus of the Indian Institute of Science of Bangalore where we all stay is very pleasant with plenty of greenery in the middle of a very busy city. Plus, being at about 1000m means the temperature remains tolerable for me, to the point of letting me run in the morning.Plus, staying in a guest house in the campus also means genuine and enjoyable south Indian food.

The workshop is a mix of statisticians and of mathematicians of neurosciences, from both India and France, and we are few enough to have a lot of opportunities for discussion and potential joint projects. I gave the first talk this morning (hence a fairly short run!) on ABC model choice with random forests and, given the mixed audience, may have launched too quickly into the technicalities of the forests. Even though I think I kept the statisticians on-board for most of the talk. While the mathematical biology talks mostly went over my head (esp. when I could not resist dozing!), I enjoyed the presentation of Francis Bach of a fast stochastic gradient algorithm, where the stochastic average is only updated one term at a time, for apparently much faster convergence results. This is related with a joint work with Éric Moulines that both Éric and Francis presented in the past month. And makes me wonder at the intuition behind the major speed-up. Shrinkage to the mean maybe?