Archive for variational Bayes methods

AABI9 tidbits [& misbits]

Posted in Books, Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , on December 10, 2019 by xi'an

Today’s Advances in Approximate Bayesian Inference symposium, organised by Thang Bui, Adji Bousso Dieng, Dawen Liang, Francisco Ruiz, and Cheng Zhang, took place in front of Vancouver Harbour (and the tentalising ski slope at the back) and saw more than 400 participants, drifting away from the earlier versions which had a stronger dose of ABC and much fewer participants. There were students’ talks in a fair proportion, as well (and a massive number of posters). As of below, I took some notes during some of the talks with no pretense at exhaustivity, objectivity or accuracy. (This is a blog post, remember?!) Overall I found the day exciting (to the point I did not suffer at all from the usal naps consecutive to very short nights!) and engaging, with a lot of notions and methods I had never heard about. (Which shows how much I know nothing!)

The first talk was by Michalis Titsias, Gradient-based Adaptive Markov Chain Monte Carlo (jointly with Petros Dellaportas) involving as its objective function the multiplication of the variance of the move and of the acceptance probability, with a proposed adaptive version merging gradients, variational Bayes, neurons, and two levels of calibration parameters. The method advocates using this construction in a burnin phase rather than continuously, hence does not require advanced Markov tools for convergence assessment. (I found myself less excited by adaptation than earlier, maybe because it seems like switching one convergence problem for another, with additional design choices to be made.)The second talk was by Jakub Swiatkowsk, The k-tied Normal Distribution: A Compact Parameterization of Gaussian Mean Field Posteriors in Bayesian Neural Networks, involving mean field approximation in variational inference (loads of VI at this symposium!), meaning de facto searching for a MAP estimator, and reminding me of older factor analysis and other analyse de données projection methods, except it also involved neural networks (what else at NeurIPS?!)The third talk was by Michael Gutmann, Robust Optimisation Monte Carlo, (OMC) for implicit data generated models (Diggle & Graton, 1982), an ABC talk at last!, using a formalisation through the functional representation of the generative process and involving derivatives of the summary statistic against parameter, in that sense, with the (Bayesian) random nature of the parameter sample only induced by the (frequentist) randomness in the generative transform since a new parameter “realisation” is obtained there as the one providing minimal distance between data and pseudo-data, with no uncertainty or impact of the prior. The Jacobian of this summary transform (and once again a neural network is used to construct the summary) appears in the importance weight, leading to OMC being unstable, beyond failing to reproduce the variability expressed by the regular posterior or even the ABC posterior. It took me a while to wonder `where is Wally?!’ (the prior) as it only appears in the importance weight.

The fourth talk was by Sergey Levine, Reinforcement Learning, Optimal , Control, and Probabilistic Inference, back to Kullback-Leibler as the objective function, with linkage to optimal control (with distributions as actions?), plus again variational inference, producing an approximation in sequential settings. This sounded like a type of return of the MaxEnt prior, but the talk pace was so intense that I could not follow where the innovations stood.

The fifth talk was by Iuliia Molchanova, on Structured Semi-Implicit Variational Inference, from BAyesgroup.ru (I did not know of a Bayesian group in Russia!, as I was under the impression that Bayesian statistics were under-represented there, but apparently the situation is quite different in machine learning.) The talk brought an interesting concept of semi-implicit variational inference, exploiting some form of latent variables as far as I can understand, using mixtures of Gaussians.

The sixth talk was by Rianne van den Berg, Normalizing Flows for Discrete Data, and amounted to covering three papers also discussed in NeurIPS 2019 proper, which I found somewhat of a suboptimal approach to an invited talk, as it turned into a teaser for following talks or posters. But the teasers it contained were quite interesting as they covered normalising flows as integer valued controlled changes of variables using neural networks about which I had just became aware during the poster session, in connection with papers of Papamakarios et al., which I need to soon read.

The seventh talk was by Matthew Hoffman: Langevin Dynamics as Nonparametric Variational Inference, and sounded most interesting, both from title and later reports, as it was bridging Langevin with VI, but I alas missed it for being “stuck” in a tea-house ceremony that lasted much longer than expected. (More later on that side issue!)

After the second poster session (with a highly original proposal by Radford Neal towards creating  non-reversibility at the level of the uniform generator rather than later on), I thus only attended Emily Fox’s Stochastic Gradient MCMC for Sequential Data Sources, which superbly reviewed (in connection with a sequence of papers, including a recent one by Aicher et al.) error rate and convergence properties of stochastic gradient estimator methods there. Another paper I need to soon read!

The one before last speaker, Roman Novak, exposed a Python library about infinite neural networks, for which I had no direct connection (and talks I have always difficulties about libraries, even without a four hour sleep night) and the symposium concluded with a mild round-table. Mild because Frank Wood’s best efforts (and healthy skepticism about round tables!) to initiate controversies, we could not see much to bite from each other’s viewpoint.

did variational Bayes work?

Posted in Books, Statistics with tags , , , , , , , , , on May 2, 2019 by xi'an

An interesting ICML 2018 paper by Yuling Yao, Aki Vehtari, Daniel Simpson, and Andrew Gelman I missed last summer on [the fairly important issue of] assessing the quality or lack thereof of a variational Bayes approximation. In the sense of being near enough from the true posterior. The criterion that they propose in this paper relates to the Pareto smoothed importance sampling technique discussed in an earlier post and which I remember discussing with Andrew when he visited CREST a few years ago. The truncation of the importance weights of prior x likelihood / VB approximation avoids infinite variance issues but induces an unknown amount of bias. The resulting diagnostic is based on the estimation of the Pareto order k. If the true value of k is less than ½, the variance of the associated Pareto distribution is finite. The paper suggests to conclude at the worth of the variational approximation when the estimate of k is less than 0.7, based on the empirical assessment of the earlier paper. The paper also contains a remark on the poor performances of the generalisation of this method to marginal settings, that is, when the importance weight is the ratio of the true and variational marginals for a sub-vector of interest. I find the counter-performances somewhat worrying in that Rao-Blackwellisation arguments make me prefer marginal ratios to joint ratios. It may however be due to a poor approximation of the marginal ratio that reflects on the approximation and not on the ratio itself. A second proposal in the paper focus on solely the point estimate returned by the variational Bayes approximation. Testing that the posterior predictive is well-calibrated. This is less appealing, especially when the authors point out the “dissadvantage is that this diagnostic does not cover the case where the observed data is not well represented by the model.” In other words, misspecified situations. This potential misspecification could presumably be tested by comparing the Pareto fit based on the actual data with a Pareto fit based on simulated data. Among other deficiencies, they point that this is “a local diagnostic that will not detect unseen modes”. In other words, what you get is what you see.

19 dubious ways to compute the marginal likelihood

Posted in Books, Statistics with tags , , , , , , , , , , on December 11, 2018 by xi'an

A recent arXival on nineteen different [and not necessarily dubious!] ways to approximate the marginal likelihood of a given topology of a philogeny tree that reminded me of our San Antonio survey with Jean-Michel Marin. This includes a version of the Laplace approximation called Laplus (!), accounting for the fact that branch lengths on the tree are positive but may have a MAP at zero. Using a Beta, Gamma, or log-Normal distribution instead of a Normal. For importance sampling, the proposals are derived from either the Laplus (!) approximate distributions or from the variational Bayes solution (based on an Normal product). Harmonic means are still used here despite the obvious danger, along with a defensive version that mixes prior and posterior. Naïve Monte Carlo means simulating from the prior, while bridge sampling seems to use samples from prior and posterior distributions. Path and modified path sampling versions are those proposed in 2008 by Nial Friel and Tony Pettitt (QUT). Stepping stone sampling appears like another version of path sampling, also based on a telescopic product of ratios of normalising constants, the generalised version relying on a normalising reference distribution that need be calibrated. CPO and PPD in the above table are two versions based on posterior predictive density estimates.

When running the comparison between so many contenders, the ground truth is selected as the values returned by MrBayes in a massive MCMC experiment amounting to 7.5 billions generations. For five different datasets. The above picture describes mean square errors for the probabilities of split, over ten replicates [when meaningful], the worst case being naïve Monte Carlo, with nested sampling and harmonic mean solutions close by. Similar assessments proceed from a comparison of Kullback-Leibler divergences. With the (predicatble?) note that “the methods do a better job approximating the marginal likelihood of more probable trees than less probable trees”. And massive variability for the poorest methods:

The comparison above does not account for time and since some methods are deterministic (and fast) there is little to do about this. The stepping steps solutions are very costly, while on the middle range bridge sampling outdoes path sampling. The assessment of nested sampling found in the conclusion is that it “would appear to be an unwise choice for estimating the marginal likelihoods of topologies, as it produces poor approximate posteriors” (p.12). Concluding at the Gamma Laplus approximation being the winner across all categories! (There is no ABC solution studied in this paper as the model likelihood can be computed in this setup, contrary to our own setting.)

graphe, graphons, graphez !

Posted in Books, pictures, Statistics, University life with tags , , , , , , on December 3, 2018 by xi'an

Bayes for good

Posted in Books, Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , on November 27, 2018 by xi'an

A very special weekend workshop on Bayesian techniques used for social good in many different sense (and talks) that we organised with Kerrie Mengersen and Pierre Pudlo at CiRM, Luminy, Marseilles. It started with Rebecca (Beka) Steorts (Duke) explaining [by video from Duke] how the Syrian war deaths were processed to eliminate duplicates, to be continued on Monday at the “Big” conference, Alex Volfonsky (Duke) on a Twitter experiment on the impact of being exposed to adverse opinions as depolarising (not!) or further polarising (yes), turning into network causal analysis. And then Kerrie Mengersen (QUT) on the use of Bayesian networks in ecology, through observational studies she conducted. And the role of neutral statisticians in case of adversarial experts!

Next day, the first talk of David Corlis (Peace-Work), who writes the Stats for Good column in CHANCE and here gave a recruiting spiel for volunteering in good initiatives. Quoting Florence Nightingale as the “first” volunteer. And presenting a broad collection of projects as supports to his recommendations for “doing good”. We then heard [by video] Julien Cornebise from Element AI in London telling of his move out of DeepMind towards investing in social impacting projects through this new startup. Including working with Amnesty International on Darfour village destructions, building evidence from satellite imaging. And crowdsourcing. With an incoming report on the year activities (still under embargo). A most exciting and enthusiastic talk!

Continue reading

JSM 2018 [#3]

Posted in Mountains, Statistics, Travel, University life with tags , , , , , , , , , , , , , , on August 1, 2018 by xi'an

As I skipped day #2 for climbing, here I am on day #3, attending JSM 2018, with a [fully Canadian!] session on (conditional) copula (where Bruno Rémillard talked of copulas for mixed data, with unknown atoms, which sounded like an impossible target!), and another on four highlights from Bayesian Analysis, (the journal), with Maria Terres defending the (often ill-considered!) spectral approach within Bayesian analysis, modelling spectral densities (Fourier transforms of correlations functions, not probability densities), an advantage compared with MCAR modelling being the automated derivation of dependence graphs. While the spectral ghost did not completely dissipate for me, the use of DIC that she mentioned at the very end seems to call for investigation as I do not know of well-studied cases of complex dependent data with clearly specified DICs. Then Chris Drobandi was speaking of ABC being used for prior choice, an idea I vaguely remember seeing quite a while ago as a referee (or another paper!), paper in BA that I missed (and obviously did not referee). Using the same reference table works (for simple ABC) with different datasets but also different priors. I did not get first the notion that the reference table also produces an evaluation of the marginal distribution but indeed the entire simulation from prior x generative model gives a Monte Carlo representation of the marginal, hence the evidence at the observed data. Borrowing from Evans’ fringe Bayesian approach to model choice by prior predictive check for prior-model conflict. I remain sceptic or at least agnostic on the notion of using data to compare priors. And here on using ABC in tractable settings.

The afternoon session was [a mostly Australian] Advanced Bayesian computational methods,  with Robert Kohn on variational Bayes, with an interesting comparison of (exact) MCMC and (approximative) variational Bayes results for some species intensity and the remark that forecasting may be much more tolerant to the approximation than estimation. Making me wonder at a possibility of assessing VB on the marginals manageable by MCMC. Unless I miss a complexity such that the decomposition is impossible. And Antonietta Mira on estimating time-evolving networks estimated by ABC (which Anto first showed me in Orly airport, waiting for her plane!). With a possibility of a zero distance. Next talk by Nadja Klein on impicit copulas, linked with shrinkage properties I was unaware of, including the case of spike & slab copulas. Michael Smith also spoke of copulas with discrete margins, mentioning a version with continuous latent variables (as I thought could be done during the first session of the day), then moving to variational Bayes which sounds quite popular at JSM 2018. And David Gunawan made a presentation of a paper mixing pseudo-marginal Metropolis with particle Gibbs sampling, written with Chris Carter and Robert Kohn, making me wonder at their feature of using the white noise as an auxiliary variable in the estimation of the likelihood, which is quite clever but seems to get against the validation of the pseudo-marginal principle. (Warning: I have been known to be wrong!)

clustering dynamical networks

Posted in pictures, Statistics, University life with tags , , , , , , , , , , on June 5, 2018 by xi'an


Yesterday I attended a presentation by Catherine Matias on dynamic graph structures, as she was giving a plenary talk at the 50th French statistical meeting, conveniently located a few blocks away from my office at ENSAE-CREST. In the nicely futuristic buildings of the EDF campus, which are supposed to represent cogs according to the architect, but which remind me more of these gas holders so common in the UK, at least in the past! (The E of EDF stands for electricity, but the original public company handled both gas and electricity.) This was primarily a survey of the field, which is much more diverse and multifaceted than I realised, even though I saw some recent developments by Antonietta Mira and her co-authors, as well as refereed a thesis on temporal networks at Ca’Foscari by Matteo Iacopini, which defence I will attend in early July. The difficulty in the approaches covered by Catherine stands with the amount and complexity of the latent variables induced by the models superimposed on the data. In her paper with Christophe Ambroise, she followed a variational EM approach. From the spectator perspective that is mine, I wondered at using ABC instead, which is presumably costly when the data size grows in space or in time. And at using tensor structures as in Mateo’s thesis. This reminded me as well of Luke Bornn’s modelling of basketball games following each player in real time throughout the game. (Which does not prevent the existence of latent variables.) But more vaguely and speculatively I also wonder at the meaning of the chosen models, which try to represent “everything” in the observed process, which seems doomed from the start given the heterogeneity of the data. While reaching my Keynesian pessimistic low- point, which happens rather quickly!, one could hope for projection techniques, towards reducing the dimension of the data of interest and of the parameter required by the model.