Archive for variational Bayes methods

clustering dynamical networks

Posted in Statistics, University life, pictures with tags , , , , , , , , , , on June 5, 2018 by xi'an

Yesterday I attended a presentation by Catherine Matias on dynamic graph structures, as she was giving a plenary talk at the 50th French statistical meeting, conveniently located a few blocks away from my office at ENSAE-CREST. In the nicely futuristic buildings of the EDF campus, which are supposed to represent cogs according to the architect, but which remind me more of these gas holders so common in the UK, at least in the past! (The E of EDF stands for electricity, but the original public company handled both gas and electricity.) This was primarily a survey of the field, which is much more diverse and multifaceted than I realised, even though I saw some recent developments by Antonietta Mira and her co-authors, as well as refereed a thesis on temporal networks at Ca’Foscari by Matteo Iacopini, which defence I will attend in early July. The difficulty in the approaches covered by Catherine stands with the amount and complexity of the latent variables induced by the models superimposed on the data. In her paper with Christophe Ambroise, she followed a variational EM approach. From the spectator perspective that is mine, I wondered at using ABC instead, which is presumably costly when the data size grows in space or in time. And at using tensor structures as in Mateo’s thesis. This reminded me as well of Luke Bornn’s modelling of basketball games following each player in real time throughout the game. (Which does not prevent the existence of latent variables.) But more vaguely and speculatively I also wonder at the meaning of the chosen models, which try to represent “everything” in the observed process, which seems doomed from the start given the heterogeneity of the data. While reaching my Keynesian pessimistic low- point, which happens rather quickly!, one could hope for projection techniques, towards reducing the dimension of the data of interest and of the parameter required by the model.

Bayesian synthetic likelihood [a reply from the authors]

Posted in Books, pictures, Statistics, University life with tags , , , on December 26, 2017 by xi'an

[Following my comments on the Bayesian synthetic likelihood paper in JGCS, the authors sent me the following reply by Leah South (previously Leah Price).]

Thanks Christian for your comments!

ucgsThe pseudo-marginal idea is useful here because it tells us that in the ideal case in which the model statistic is normal and if we use the unbiased density estimator of the normal then we have an MCMC algorithm that converges to the same target regardless of the value of n (number of model simulations per MCMC iteration). It is true that the bias reappears in the case of misspecification. We found that the target based on the simple plug-in Gaussian density was also remarkably insensitive to n. Given this insensitivity, we consider calling again on the pseudo-marginal literature to offer guidance in choosing n to minimise computational effort and we recommend the use of the plug-in Gaussian density in BSL because it is simpler to implement.

“I am also lost to the argument that the synthetic version is more efficient than ABC, in general”

Given the parametric approximation to the summary statistic likelihood, we expect BSL to be computationally more efficient than ABC. We show this is the case theoretically in a toy example in the paper and find empirically on a number of examples that BSL is more computationally efficient, but we agree that further analysis would be of interest.

The concept of using random forests to handle additional summary statistics is interesting and useful. BSL was able to utilise all the information in the high dimensional summary statistics that we considered rather than resorting to dimension reduction (implying a loss of information), and we believe that is a benefit of BSL over standard ABC. Further, in high-dimensional parameter applications the summary statistic dimension will necessarily be large even if there is one statistic per parameter. BSL can be very useful in such problems. In fact we have done some work on exactly this, combining variational Bayes with synthetic likelihood.

Another benefit of BSL is that it is easier to tune (there are fewer tuning parameters and the BSL target is highly insensitive to n). Surprisingly, BSL performs reasonably well when the summary statistics are not normally distributed — as long as they aren’t highly irregular!

the invasion of the stochastic gradients

Posted in Statistics with tags , , , , , , , , , on May 10, 2017 by xi'an

Within the same day, I spotted three submissions to arXiv involving stochastic gradient descent, that I briefly browsed on my trip back from Wales:

  1. Stochastic Gradient Descent as Approximate Bayesian inference, by Mandt, Hoffman, and Blei, where this technique is used as a type of variational Bayes method, where the minimum Kullback-Leibler distance to the true posterior can be achieved. Rephrasing the [scalable] MCMC algorithm of Welling and Teh (2011) as such an approximation.
  2. Further and stronger analogy between sampling and optimization: Langevin Monte Carlo and gradient descent, by Arnak Dalalyan, which establishes a convergence of the uncorrected Langevin algorithm to the right target distribution in the sense of the Wasserstein distance. (Uncorrected in the sense that there is no Metropolis step, meaning this is a Euler approximation.) With an extension to the noisy version, when the gradient is approximated eg by subsampling. The connection with stochastic gradient descent is thus tenuous, but Arnak explains the somewhat disappointing rate of convergence as being in agreement with optimisation rates.
  3. Stein variational adaptive importance sampling, by Jun Han and Qiang Liu, which relates to our population Monte Carlo algorithm, but as a non-parametric version, using RKHS to represent the transforms of the particles at each iteration. The sampling method follows two threads of particles, one that is used to estimate the transform by a stochastic gradient update, and another one that is used for estimation purposes as in a regular population Monte Carlo approach. Deconstructing into those threads allows for conditional independence that makes convergence easier to establish. (A problem we also hit when working on the AMIS algorithm.)

X divergence for approximate inference

Posted in Statistics with tags , , , , , , , on March 14, 2017 by xi'an

Dieng et al. arXived this morning a new version of their paper on using the Χ divergence for variational inference. The Χ divergence essentially is the expectation of the squared ratio of the target distribution over the approximation, under the approximation. It is somewhat related to Expectation Propagation (EP), which aims at the Kullback-Leibler divergence between the target distribution and the approximation, under the target. And to variational Bayes, which is the same thing just the opposite way! The authors also point a link to our [adaptive] population Monte Carlo paper of 2008. (I wonder at a possible version through Wasserstein distance.)

Some of the arguments in favour of this new version of variational Bayes approximations is that (a) the support of the approximation over-estimates the posterior support; (b) it produces over-dispersed versions; (c) it relates to a well-defined and global objective function; (d) it allows for a sandwich inequality on the model evidence; (e) the function of the [approximation] parameter to be minimised is under the approximation, rather than under the target. The latest allows for a gradient-based optimisation. While one of the applications is on a Bayesian probit model applied to the Pima Indian women dataset [and will thus make James and Nicolas cringe!], the experimental assessment shows lower error rates for this and other benchmarks. Which in my opinion does not tell so much about the original Bayesian approach.

automatic variational ABC

Posted in pictures, Statistics with tags , , , , , , , , , , on July 8, 2016 by xi'an

Amster11“Stochastic Variational inference is an appealing alternative to the inefficient sampling approaches commonly used in ABC.”

Moreno et al. [including Ted Meeds and Max Welling] recently arXived a paper merging variational inference and ABC. The argument for turning variational is computational speedup. The traditional (in variational inference) divergence decomposition of the log-marginal likelihood is replaced by an ABC version, parameterised in terms of intrinsic generators (i.e., generators that do not depend on cyber-parameters, like the U(0,1) or the N(0,1) generators). Or simulation code in the authors’ terms. Which leads to the automatic aspect of the approach. In the paper the derivation of the gradient is indeed automated.

“One issue is that even assuming that the ABC likelihood is an unbiased estimator of the true likelihood (which it is not), taking the log introduces a bias, so that we now have a biased estimate of the lower bound and thus biased gradients.”

I wonder how much of an issue this is, since we consider the variational lower bound. To be optimised in terms of the parameters of the variational posterior. Indeed, the endpoint of the analysis is to provide an optimal variational approximation, which remains an approximation whether or not the likelihood estimator is unbiased. A more “severe” limitation may be in the inversion constraint, since it seems to eliminate Beta or Gamma distributions. (Even though calling qbeta(runif(1),a,b) definitely is achievable… And not rejected by a Kolmogorov-Smirnov test.)

Incidentally, I discovered through the paper the existence of the Kumaraswamy distribution, which main appeal seems to be the ability to produce a closed-form quantile function, while bearing some resemblance with the Beta distribution. (Another arXival by Baltasar Trancón y Widemann studies some connections between those, but does not tell how to select the parameters to optimise the similarity.)

Michael Jordan’s seminar in Paris next week

Posted in Statistics, University life with tags , , , , , on June 3, 2016 by xi'an

Next week, on June 7, at 4pm, Michael will give a seminar at INRIA, rue du Charolais, Paris 12 (map). Here is the abstract:

A Variational Perspective on Accelerated Methods in Optimization

Accelerated gradient methods play a central role in optimization,achieving optimal rates in many settings. While many generalizations and extensions of Nesterov’s original acceleration method have been proposed,it is not yet clear what is the natural scope of the acceleration concept.In this paper, we study accelerated methods from a continuous-time perspective. We show that there is a Lagrangian functional that we call the Bregman Lagrangian which generates a large class of accelerated methods in continuous time, including (but not limited to) accelerated gradient descent, its non-Euclidean extension, and accelerated higher-order gradient methods. We show that the continuous-time limit of all of these methods correspond to travelling the same curve in space time at different speeds, and in this sense the continuous-time setting is the natural one for understanding acceleration.  Moreover, from this perspective, Nesterov’s technique and many of its generalizations can be viewed as a systematic way to go from the continuous-time curves generated by the Bregman Lagrangian to a family of discrete-time accelerated algorithms. [Joint work with Andre Wibisono and Ashia Wilson.]

(Interested readers need to register to attend the lecture.)

variational Bayes for variable selection

Posted in Books, Statistics, University life with tags , , , , , , , on March 30, 2016 by xi'an

Lake Agnes, Canadian Rockies, July 2007Xichen Huang, Jin Wang and Feng Liang have recently arXived a paper where they rely on variational Bayes in conjunction with a spike-and-slab prior modelling. This actually stems from an earlier paper by Carbonetto and Stephens (2012), the difference being in the implementation of the method, which is less Gibbs-like for the current paper. The approach is not fully Bayesian in that, not only an approximate (variational) representation is used for the parameters of interest (regression coefficient and presence-absence indicators) but also the nuisance parameters are replaced with MAPs. The variational approximation on the regression parameters is an independent product of spike-and-slab distributions. The authors show the approximate approach is consistent in both frequentist and Bayesian terms (under identifiability assumptions). The method is undoubtedly faster than MCMC since it shares many features with EM but I still wonder at the Bayesian interpretability of the outcome, which writes out as a product of estimated spike-and-slab mixtures. First, the weights in the mixtures are estimated by EM, hence fixed. Second, the fact that the variational approximation is a product is confusing in that the posterior distribution on the regression coefficients is unlikely to produce posterior independence.