**A**n interesting ICML 2018 paper by Yuling Yao, Aki Vehtari, Daniel Simpson, and Andrew Gelman I missed last summer on [the fairly important issue of] assessing the quality or lack thereof of a variational Bayes approximation. In the sense of being near enough from the true posterior. The criterion that they propose in this paper relates to the Pareto smoothed importance sampling technique discussed in an earlier post and which I remember discussing with Andrew when he visited CREST a few years ago. The truncation of the importance weights of prior x likelihood / VB approximation avoids infinite variance issues but induces an unknown amount of bias. The resulting diagnostic is based on the estimation of the Pareto order k. If the true value of k is less than ½, the variance of the associated Pareto distribution is finite. The paper suggests to conclude at the worth of the variational approximation when the estimate of k is less than 0.7, based on the empirical assessment of the earlier paper. The paper also contains a remark on the poor performances of the generalisation of this method to marginal settings, that is, when the importance weight is the ratio of the true and variational marginals for a sub-vector of interest. I find the counter-performances somewhat worrying in that Rao-Blackwellisation arguments make me prefer marginal ratios to joint ratios. It may however be due to a poor approximation of the marginal ratio that reflects on the approximation and not on the ratio itself. A second proposal in the paper focus on solely the point estimate returned by the variational Bayes approximation. Testing that the posterior predictive is well-calibrated. This is less appealing, especially when the authors point out the “dissadvantage is that this diagnostic does not cover the case where the observed data is not well represented by the model.” In other words, misspecified situations. This potential misspecification could presumably be tested by comparing the Pareto fit based on the actual data with a Pareto fit based on simulated data. Among other deficiencies, they point that this is “a local diagnostic that will not detect unseen modes”. In other words, *what you get is what you see*.

## Archive for variational Bayes methods

## did variational Bayes work?

Posted in Books, Statistics with tags approximate Bayesian inference, asymptotic Bayesian methods, ICML 2018, importance sampling, misspecified model, Pareto distribution, Pareto smoothed importance sampling, posterior predictive, variational Bayes methods, what you get is what you see on May 2, 2019 by xi'an## 19 dubious ways to compute the marginal likelihood

Posted in Books, Statistics with tags ABC, bridge sampling, harmonic mean estimator, Laplace approximation, MCMC-free, Monte Carlo Statistical Methods, nested sampling, path sampling, power likelihood, stepping stone sampling, variational Bayes methods on December 11, 2018 by xi'an**A** recent arXival on nineteen different [and not necessarily dubious!] ways to approximate the marginal likelihood of a given topology of a philogeny tree that reminded me of our San Antonio survey with Jean-Michel Marin. This includes a version of the Laplace approximation called Laplus (!), accounting for the fact that branch lengths on the tree are positive but may have a MAP at zero. Using a Beta, Gamma, or log-Normal distribution instead of a Normal. For importance sampling, the proposals are derived from either the Laplus (!) approximate distributions or from the variational Bayes solution (based on an Normal product). Harmonic means are still used here despite the obvious danger, along with a defensive version that mixes prior and posterior. Naïve Monte Carlo means simulating from the prior, while bridge sampling seems to use samples from prior and posterior distributions. Path and modified path sampling versions are those proposed in 2008 by Nial Friel and Tony Pettitt (QUT). Stepping stone sampling appears like another version of path sampling, also based on a telescopic product of ratios of normalising constants, the generalised version relying on a normalising reference distribution that need be calibrated. CPO and PPD in the above table are two versions based on posterior predictive density estimates.

When running the comparison between so many contenders, the ground truth is selected as the values returned by MrBayes in a massive MCMC experiment amounting to 7.5 billions generations. For five different datasets. The above picture describes mean square errors for the probabilities of split, over ten replicates [when meaningful], the worst case being naïve Monte Carlo, with nested sampling and harmonic mean solutions close by. Similar assessments proceed from a comparison of Kullback-Leibler divergences. With the (predicatble?) note that “the methods do a better job approximating the marginal likelihood of more probable trees than less probable trees”. And massive variability for the poorest methods:

The comparison above does not account for time and since some methods are deterministic (and fast) there is little to do about this. The stepping steps solutions are very costly, while on the middle range bridge sampling outdoes path sampling. The assessment of nested sampling found in the conclusion is that it “would appear to be an unwise choice for estimating the marginal likelihoods of topologies, as it produces poor approximate posteriors” (p.12). Concluding at the Gamma Laplus approximation being the winner across all categories! (There is no ABC solution studied in this paper as the model likelihood can be computed in this setup, contrary to our own setting.)

## graphe, graphons, graphez !

Posted in Books, pictures, Statistics, University life with tags graphs, Institut Henri Poincaré, mathematical statistics, Paris, phase transition, SFDS, variational Bayes methods on December 3, 2018 by xi'an## JSM 2018 [#3]

Posted in Mountains, Statistics, Travel, University life with tags ABC, Approximate Bayesian computation, Bayesian network, Bayesian p-values, British Columbia, Canada, curse of dimensionality, JSM 2018, prior predictive, pseudo-marginal MCMC, spectral analysis, spike-and-slab prior, stochastic gradient descent, Vancouver, variational Bayes methods on August 1, 2018 by xi'an**A**s I skipped day #2 for climbing, here I am on day #3, attending JSM 2018, with a [fully Canadian!] session on (conditional) copula (where Bruno Rémillard talked of copulas for mixed data, with unknown atoms, which sounded like an impossible target!), and another on four highlights from Bayesian Analysis, (the journal), with Maria Terres defending the (often ill-considered!) spectral approach within Bayesian analysis, modelling spectral densities (Fourier transforms of correlations functions, not probability densities), an advantage compared with MCAR modelling being the automated derivation of dependence graphs. While the spectral ghost did not completely dissipate for me, the use of DIC that she mentioned at the very end seems to call for investigation as I do not know of well-studied cases of complex dependent data with clearly specified DICs. Then Chris Drobandi was speaking of ABC being used for prior choice, an idea I vaguely remember seeing quite a while ago as a referee (or another paper!), paper in BA that I missed (and obviously did not referee). Using the same reference table works (for simple ABC) with different datasets but also different priors. I did not get first the notion that the reference table also produces an evaluation of the marginal distribution but indeed the entire simulation from prior x generative model gives a Monte Carlo representation of the marginal, hence the evidence at the observed data. Borrowing from Evans’ fringe Bayesian approach to model choice by prior predictive check for prior-model conflict. I remain sceptic or at least agnostic on the notion of using data to compare priors. And here on using ABC in tractable settings.

The afternoon session was [a mostly Australian] Advanced Bayesian computational methods, with Robert Kohn on variational Bayes, with an interesting comparison of (exact) MCMC and (approximative) variational Bayes results for some species intensity and the remark that forecasting may be much more tolerant to the approximation than estimation. Making me wonder at a possibility of assessing VB on the marginals manageable by MCMC. Unless I miss a complexity such that the decomposition is impossible. And Antonietta Mira on estimating time-evolving networks estimated by ABC (which Anto first showed me in Orly airport, waiting for her plane!). With a possibility of a zero distance. Next talk by Nadja Klein on impicit copulas, linked with shrinkage properties I was unaware of, including the case of spike & slab copulas. Michael Smith also spoke of copulas with discrete margins, mentioning a version with continuous latent variables (as I thought could be done during the first session of the day), then moving to variational Bayes which sounds quite popular at JSM 2018. And David Gunawan made a presentation of a paper mixing pseudo-marginal Metropolis with particle Gibbs sampling, written with Chris Carter and Robert Kohn, making me wonder at their feature of using the white noise as an auxiliary variable in the estimation of the likelihood, which is quite clever but seems to get against the validation of the pseudo-marginal principle. *(Warning: I have been known to be wrong!)*

## Bayesian synthetic likelihood [a reply from the authors]

Posted in Books, pictures, Statistics, University life with tags Bayesian synthetic likelihood, misspecification, pseudo-marginal, variational Bayes methods on December 26, 2017 by xi'an*[Following my comments on the Bayesian synthetic likelihood paper in JGCS, the authors sent me the following reply by Leah South (previously Leah Price).]*

Thanks Christian for your comments!

The pseudo-marginal idea is useful here because it tells us that in the ideal case in which the model statistic is normal and if we use the unbiased density estimator of the normal then we have an MCMC algorithm that converges to the same target regardless of the value of n (number of model simulations per MCMC iteration). It is true that the bias reappears in the case of misspecification. We found that the target based on the simple plug-in Gaussian density was also remarkably insensitive to n. Given this insensitivity, we consider calling again on the pseudo-marginal literature to offer guidance in choosing n to minimise computational effort and we recommend the use of the plug-in Gaussian density in BSL because it is simpler to implement.

“I am also lost to the argument that the synthetic version is more efficient than ABC, in general”

Given the parametric approximation to the summary statistic likelihood, we expect BSL to be computationally more efficient than ABC. We show this is the case theoretically in a toy example in the paper and find empirically on a number of examples that BSL is more computationally efficient, but we agree that further analysis would be of interest.

The concept of using random forests to handle additional summary statistics is interesting and useful. BSL was able to utilise all the information in the high dimensional summary statistics that we considered rather than resorting to dimension reduction (implying a loss of information), and we believe that is a benefit of BSL over standard ABC. Further, in high-dimensional parameter applications the summary statistic dimension will necessarily be large even if there is one statistic per parameter. BSL can be very useful in such problems. In fact we have done some work on exactly this, combining variational Bayes with synthetic likelihood.

Another benefit of BSL is that it is easier to tune (there are fewer tuning parameters and the BSL target is highly insensitive to n). Surprisingly, BSL performs reasonably well when the summary statistics are not normally distributed — as long as they aren’t highly irregular!