Archive for ABC

JSM 2018 [#3]

Posted in Mountains, Statistics, Travel, University life with tags , , , , , , , , , , , , , , on August 1, 2018 by xi'an

As I skipped day #2 for climbing, here I am on day #3, attending JSM 2018, with a [fully Canadian!] session on (conditional) copula (where Bruno Rémillard talked of copulas for mixed data, with unknown atoms, which sounded like an impossible target!), and another on four highlights from Bayesian Analysis, (the journal), with Maria Terres defending the (often ill-considered!) spectral approach within Bayesian analysis, modelling spectral densities (Fourier transforms of correlations functions, not probability densities), an advantage compared with MCAR modelling being the automated derivation of dependence graphs. While the spectral ghost did not completely dissipate for me, the use of DIC that she mentioned at the very end seems to call for investigation as I do not know of well-studied cases of complex dependent data with clearly specified DICs. Then Chris Drobandi was speaking of ABC being used for prior choice, an idea I vaguely remember seeing quite a while ago as a referee (or another paper!), paper in BA that I missed (and obviously did not referee). Using the same reference table works (for simple ABC) with different datasets but also different priors. I did not get first the notion that the reference table also produces an evaluation of the marginal distribution but indeed the entire simulation from prior x generative model gives a Monte Carlo representation of the marginal, hence the evidence at the observed data. Borrowing from Evans’ fringe Bayesian approach to model choice by prior predictive check for prior-model conflict. I remain sceptic or at least agnostic on the notion of using data to compare priors. And here on using ABC in tractable settings.

The afternoon session was [a mostly Australian] Advanced Bayesian computational methods,  with Robert Kohn on variational Bayes, with an interesting comparison of (exact) MCMC and (approximative) variational Bayes results for some species intensity and the remark that forecasting may be much more tolerant to the approximation than estimation. Making me wonder at a possibility of assessing VB on the marginals manageable by MCMC. Unless I miss a complexity such that the decomposition is impossible. And Antonietta Mira on estimating time-evolving networks estimated by ABC (which Anto first showed me in Orly airport, waiting for her plane!). With a possibility of a zero distance. Next talk by Nadja Klein on impicit copulas, linked with shrinkage properties I was unaware of, including the case of spike & slab copulas. Michael Smith also spoke of copulas with discrete margins, mentioning a version with continuous latent variables (as I thought could be done during the first session of the day), then moving to variational Bayes which sounds quite popular at JSM 2018. And David Gunawan made a presentation of a paper mixing pseudo-marginal Metropolis with particle Gibbs sampling, written with Chris Carter and Robert Kohn, making me wonder at their feature of using the white noise as an auxiliary variable in the estimation of the likelihood, which is quite clever but seems to get against the validation of the pseudo-marginal principle. (Warning: I have been known to be wrong!)

indecent exposure

Posted in Statistics with tags , , , , , , , , , , , , on July 27, 2018 by xi'an

While attending my last session at MCqMC 2018, in Rennes, before taking a train back to Paris, I was confronted by this radical opinion upon our previous work with Matt Moores (Warwick) and other coauthors from QUT, where the speaker, Maksym Byshkin from Lugano, defended a new approach for maximum likelihood estimation using novel MCMC methods. Based on the point fixe equation characterising maximum likelihood estimators for exponential families, when theoretical and empirical moments of the natural statistic are equal. Using a Markov chain with stationary distribution the said exponential family, the fixed point equation can be turned into a zero divergence equation, requiring simulation of pseudo-data from the model, which depends on the unknown parameter. Breaking this circular argument, the authors note that simulating pseudo-data that reproduce the observed value of the sufficient statistic is enough. Which is related with Geyer and Thomson (1992) famous paper about Monte Carlo maximum likelihood estimation. From there I was and remain lost as I cannot see why a derivative of the expected divergence with respect to the parameter θ can be computed when this divergence is found by Monte Carlo rather than exhaustive enumeration. And later used in a stochastic gradient move on the parameter θ… Especially when the null divergence is imposed on the parameter. In any case, the final slide shows an application to a large image and an Ising model, solving the problem (?) in 140 seconds and suggesting indecency, when our much slower approach is intended to produce a complete posterior simulation in this context.

ISBA 18 tidbits

Posted in Books, Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , on July 2, 2018 by xi'an

Among a continuous sequence of appealing sessions at this ISBA 2018 meeting [says a member of the scientific committee!], I happened to attend two talks [with a wee bit of overlap] by Sid Chib in two consecutive sessions, because his co-author Ana Simoni (CREST) was unfortunately sick. Their work was about models defined by a collection of moment conditions, as often happens in econometrics, developed in a recent JASA paper by Chib, Shin, and Simoni (2017). With an extension about moving to defining conditional expectations by use of a functional basis. The main approach relies on exponentially tilted empirical likelihoods, which reminded me of the empirical likelihood [BCel] implementation we ran with Kerrie Mengersen and Pierre Pudlo a few years ago. As a substitute to ABC. This problematic made me wonder on how much Bayesian the estimating equation concept is, as it should somewhat involve a nonparametric prior under the moment constraints.

Note that Sid’s [talks and] papers are disconnected from ABC, as everything comes in closed form, apart from the empirical likelihood derivation, as we actually found in our own work!, but this could become a substitute model for ABC uses. For instance, identifying the parameter θ of the model by identifying equations. Would that impose too much input from the modeller? I figure I came with this notion mostly because of the emphasis on proxy models the previous day at ABC in ‘burgh! Another connected item of interest in the work is the possibility of accounting for misspecification of these moment conditions by introducing a vector of errors with a spike & slab distribution, although I am not sure this is 100% necessary without getting further into the paper(s) [blame conference pressure on my time].

Another highlight was attending a fantastic poster session Monday night on computational methods except I would have needed four more hours to get through every and all posters. This new version of ISBA has split the posters between two sites (great) and themes (not so great!), while I would have preferred more sites covering all themes over all nights, to lower the noise (still bearable this year) and to increase the possibility to check all posters of interest in a particular theme…

Mentioning as well a great talk by Dan Roy about assessing deep learning performances by what he calls non-vacuous error bounds. Namely, through PAC-Bayesian bounds. One major comment of his was about deep learning models being much more non-parametric (number of parameters rising with number of observations) than parametric models, meaning that generative adversarial constructs as the one I discussed a few days ago may face a fundamental difficulty as models are taken at face value there.

On closed-form solutions, a closed-form Bayes factor for component selection in mixture models by Fũqene, Steel and Rossell that resemble the Savage-Dickey version, without the measure theoretic difficulties. But with non-local priors. And closed-form conjugate priors for the probit regression model, using unified skew-normal priors, as exhibited by Daniele Durante. Which are product of Normal cdfs and pdfs, and which allow for closed form marginal likelihoods and marginal posteriors as well. (The approach is not exactly conjugate as the prior and the posterior are not in the same family.)

And on the final session I attended, there were two talks on scalable MCMC, one on coresets, which will require some time and effort to assimilate, by Trevor Campbell and Tamara Broderick, and another one using Poisson subsampling. By Matias Quiroz and co-authors. Which did not completely convinced me (but this was the end of a long day…)

All in all, this has been a great edition of the ISBA meetings, if quite intense due to a non-stop schedule, with a very efficient organisation that made parallel sessions manageable and poster sessions back to a reasonable scale [although I did not once manage to cross the street to the other session]. Being in unreasonably sunny Edinburgh helped a lot obviously! I am a wee bit disappointed that no one else follows my call to wear a kilt, but I had low expectations to start with… And too bad I missed the Ironman 70.3 Edinburgh by one day!

ABC in Ed’burgh

Posted in Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , on June 28, 2018 by xi'an

A glorious day for this new edition of the “ABC in…” workshops, in the capital City of Edinburgh! I enjoyed very much this ABC day for demonstrating ABC is still alive and kicking!, i.e., enjoying plenty of new developments and reinterpretations. With more talks and posters on the way during the main ISBA 2018 meeting. (All nine talks are available on the webpage of the conference.)

After Michael Gutmann’s tutorial on ABC, Gael Martin (Monash) presented her recent work with David Frazier, Ole Maneesoonthorn, and Brendan McCabe on ABC  for prediction. Maybe unsurprisingly, Bayesian consistency for the given summary statistics is a sufficient condition for concentration of the ABC predictor, but ABC seems to do better for the prediction problem than for parameter estimation, not losing to exact Bayesian inference, possibly because in essence the summary statistics there need not be of a large dimension to being consistent. The following talk by Guillaume Kon Kam King was also about prediction, for the specific problem of gas offer, with a latent Wright-Fisher point process in the model. He used a population ABC solution to handle this model.

Alexander Buchholz (CREST) introduced an ABC approach with quasi-Monte Carlo steps that helps in reducing the variability and hence improves the approximation in ABC. He also looked at a Negative Geometric variant of regular ABC by running a random number of proposals until reaching a given number of acceptances, which while being more costly produces more stability.

Other talks by Trevelyan McKinley, Marko Järvenpää, Matt Moores (Warwick), and Chris Drovandi (QUT) illustrated the urge of substitute models as a first step, and not solely via Gaussian processes. With for instance the new notion of a loss function to evaluate this approximation. Chris made a case in favour of synthetic vs ABC approaches, due to degradation of the performances of nonparametric density estimation with the dimension. But I remain a doubting Thomas [Bayes] on that point as high dimensions in the data or the summary statistics are not necessarily the issue, as also processed in the paper on ABC-CDE discussed on a recent post. While synthetic likelihood requires estimating a mean function and a covariance function of the parameter of the dimension of the summary statistic. Even though estimated by simulation.

Another neat feature of the day was a special session on cosmostatistics with talks by Emille Ishida and Jessica Cisewski, from explaining how ABC was starting to make an impact on cosmo- and astro-statistics, to the special example of the stellar initial mass distribution in clusters.

Call is now open for the next “ABC in”! Note that, while these workshops have been often formally sponsored by ISBA and its BayesComp section, they are not managed by a society or a board of administrators, and hence are not much contrived by a specific format. It would just be nice to keep the low fees as part of the tradition.

Bayesian gan [gan style]

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , on June 26, 2018 by xi'an

In their paper Bayesian GANS, arXived a year ago, Saatchi and Wilson consider a Bayesian version of generative adversarial networks, putting priors on both the model and the discriminator parameters. While the prospect seems somewhat remote from genuine statistical inference, if the following statement is representative

“GANs transform white noise through a deep neural network to generate candidate samples from a data distribution. A discriminator learns, in a supervised manner, how to tune its parameters so as to correctly classify whether a given sample has come from the generator or the true data distribution. Meanwhile, the generator updates its parameters so as to fool the discriminator. As long as the generator has sufficient capacity, it can approximate the cdf inverse-cdf composition required to sample from a data distribution of interest.”

I figure the concept can also apply to a standard statistical model, where x=G(z,θ) rephrases the distributional assumption x~F(x;θ) via a white noise z. This makes resorting to a prior distribution on θ more relevant in the sense of using potential prior information on θ (although the successes of probabilistic numerics show formal priors can be used on purely numerical ground).

The “posterior distribution” that is central to the notion of Bayesian GANs is however unorthodox in that the distribution is associated with the following conditional posteriors

where D(x,θ) is the “discriminator”, that is, in GAN lingo, the probability to be allocated to the “true” data generating mechanism rather than to the one associated with G(·,θ). The generative conditional posterior (1) then aims at fooling the discriminator, i.e. favours generative parameter values that raise the probability of wrong allocation of the pseudo-data. The discriminative conditional posterior (2) is a standard Bayesian posterior based on the original sample and the generated sample. The authors then iteratively sample from these posteriors, effectively implementing a two-stage Gibbs sampler.

“By iteratively sampling from (1) and (2) at every step of an epoch one can, in the limit, obtain samples from the approximate posteriors over [both sets of parameters].”

What worries me about this approach is that  just cannot work, in the sense that (1) and (2) cannot be compatible conditional (posterior) distributions. There is no joint distribution for which (1) and (2) would be the conditionals, since the pseudo-data appears in D for (1) and (1-D) in (2). This means that the convergence of a Gibbs sampler is at best to a stationary σ-finite measure. And hence that the meaning of the chain is delicate to ascertain… Am I missing any fundamental point?! [I checked the reviews on NIPS webpage and could not spot this issue being raised.]

ABC²DE

Posted in Books, Statistics with tags , , , , , , , , , , , , , on June 25, 2018 by xi'an

A recent arXival on a new version of ABC based on kernel estimators (but one could argue that all ABC versions are based on kernel estimators, one way or another.) In this ABC-CDE version, Izbicki,  Lee and Pospisilz [from CMU, hence the picture!] argue that past attempts failed to exploit the full advantages of kernel methods, including the 2016 ABCDE method (from Edinburgh) briefly covered on this blog. (As an aside, CDE stands for conditional density estimation.) They also criticise these attempts at selecting summary statistics and hence failing in sufficiency, which seems a non-issue to me, as already discussed numerous times on the ‘Og. One point of particular interest in the long list of drawbacks found in the paper is the inability to compare several estimates of the posterior density, since this is not directly ingrained in the Bayesian construct. Unless one moves to higher ground by calling for Bayesian non-parametrics within the ABC algorithm, a perspective which I am not aware has been pursued so far…

The selling points of ABC-CDE are that (a) the true focus is on estimating a conditional density at the observable x⁰ rather than everywhere. Hence, rejecting simulations from the reference table if the pseudo-observations are too far from x⁰ (which implies using a relevant distance and/or choosing adequate summary statistics). And then creating a conditional density estimator from this subsample (which makes me wonder at a double use of the data).

The specific density estimation approach adopted for this is called FlexCode and relates to an earlier if recent paper from Izbicki and Lee I did not read. As in many other density estimation approaches, they use an orthonormal basis (including wavelets) in low dimension to estimate the marginal of the posterior for one or a few components of the parameter θ. And noticing that the posterior marginal is a weighted average of the terms in the basis, where the weights are the posterior expectations of the functions themselves. All fine! The next step is to compare [posterior] estimators through an integrated squared error loss that does not integrate the prior or posterior and does not tell much about the quality of the approximation for Bayesian inference in my opinion. It is furthermore approximated by  a doubly integrated [over parameter and pseudo-observation] squared error loss, using the ABC(ε) sample from the prior predictive. And the approximation error only depends on the regularity of the error, that is the difference between posterior and approximated posterior. Which strikes me as odd, since the Monte Carlo error should take over but does not appear at all. I am thus unclear as to whether or not the convergence results are that relevant. (A difficulty with this paper is the strong dependence on the earlier one as it keeps referencing one version or another of FlexCode. Without reading the original one, I spotted a mention made of the use of random forests for selecting summary statistics of interest, without detailing the difference with our own ABC random forest papers (for both model selection and estimation). For instance, the remark that “nuisance statistics do not affect the performance of FlexCode-RF much” reproduces what we observed with ABC-RF.

The long experiment section always relates to the most standard rejection ABC algorithm, without accounting for the many alternatives produced in the literature (like Li and Fearnhead, 2018. that uses Beaumont et al’s 2002 scheme, along with importance sampling improvements, or ours). In the case of real cosmological data, used twice, I am uncertain of the comparison as I presume the truth is unknown. Furthermore, from having worked on similar data a dozen years ago, it is unclear why ABC is necessary in such context (although I remember us running a test about ABC in the Paris astrophysics institute once).

ABC’ptotics on-line

Posted in Statistics with tags , , , , , , , on June 14, 2018 by xi'an

Our paper on Asymptotic properties of ABC with David Frazier, Gael Martin, and Judith Rousseau, is now on-line on the Biometrika webpage. Coincidentally both papers by Wentao Li and Paul Fearnhead on ABC’ptotics are published in the June issue of the journal.

Approximate Bayesian computation allows for statistical analysis using models with intractable likelihoods. In this paper we consider the asymptotic behaviour of the posterior distribution obtained by this method. We give general results on the rate at which the posterior distribution concentrates on sets containing the true parameter, the limiting shape of the posterior distribution, and the asymptotic distribution of the posterior mean. These results hold under given rates for the tolerance used within the method, mild regularity conditions on the summary statistics, and a condition linked to identification of the true parameters. Implications for practitioners are discussed.