Archive for Bayesian inference

ISBA 18 tidbits

Posted in Books, Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , on July 2, 2018 by xi'an

Among a continuous sequence of appealing sessions at this ISBA 2018 meeting [says a member of the scientific committee!], I happened to attend two talks [with a wee bit of overlap] by Sid Chib in two consecutive sessions, because his co-author Ana Simoni (CREST) was unfortunately sick. Their work was about models defined by a collection of moment conditions, as often happens in econometrics, developed in a recent JASA paper by Chib, Shin, and Simoni (2017). With an extension about moving to defining conditional expectations by use of a functional basis. The main approach relies on exponentially tilted empirical likelihoods, which reminded me of the empirical likelihood [BCel] implementation we ran with Kerrie Mengersen and Pierre Pudlo a few years ago. As a substitute to ABC. This problematic made me wonder on how much Bayesian the estimating equation concept is, as it should somewhat involve a nonparametric prior under the moment constraints.

Note that Sid’s [talks and] papers are disconnected from ABC, as everything comes in closed form, apart from the empirical likelihood derivation, as we actually found in our own work!, but this could become a substitute model for ABC uses. For instance, identifying the parameter θ of the model by identifying equations. Would that impose too much input from the modeller? I figure I came with this notion mostly because of the emphasis on proxy models the previous day at ABC in ‘burgh! Another connected item of interest in the work is the possibility of accounting for misspecification of these moment conditions by introducing a vector of errors with a spike & slab distribution, although I am not sure this is 100% necessary without getting further into the paper(s) [blame conference pressure on my time].

Another highlight was attending a fantastic poster session Monday night on computational methods except I would have needed four more hours to get through every and all posters. This new version of ISBA has split the posters between two sites (great) and themes (not so great!), while I would have preferred more sites covering all themes over all nights, to lower the noise (still bearable this year) and to increase the possibility to check all posters of interest in a particular theme…

Mentioning as well a great talk by Dan Roy about assessing deep learning performances by what he calls non-vacuous error bounds. Namely, through PAC-Bayesian bounds. One major comment of his was about deep learning models being much more non-parametric (number of parameters rising with number of observations) than parametric models, meaning that generative adversarial constructs as the one I discussed a few days ago may face a fundamental difficulty as models are taken at face value there.

On closed-form solutions, a closed-form Bayes factor for component selection in mixture models by Fũqene, Steel and Rossell that resemble the Savage-Dickey version, without the measure theoretic difficulties. But with non-local priors. And closed-form conjugate priors for the probit regression model, using unified skew-normal priors, as exhibited by Daniele Durante. Which are product of Normal cdfs and pdfs, and which allow for closed form marginal likelihoods and marginal posteriors as well. (The approach is not exactly conjugate as the prior and the posterior are not in the same family.)

And on the final session I attended, there were two talks on scalable MCMC, one on coresets, which will require some time and effort to assimilate, by Trevor Campbell and Tamara Broderick, and another one using Poisson subsampling. By Matias Quiroz and co-authors. Which did not completely convinced me (but this was the end of a long day…)

All in all, this has been a great edition of the ISBA meetings, if quite intense due to a non-stop schedule, with a very efficient organisation that made parallel sessions manageable and poster sessions back to a reasonable scale [although I did not once manage to cross the street to the other session]. Being in unreasonably sunny Edinburgh helped a lot obviously! I am a wee bit disappointed that no one else follows my call to wear a kilt, but I had low expectations to start with… And too bad I missed the Ironman 70.3 Edinburgh by one day!

Bayesian gan [gan style]

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , on June 26, 2018 by xi'an

In their paper Bayesian GANS, arXived a year ago, Saatchi and Wilson consider a Bayesian version of generative adversarial networks, putting priors on both the model and the discriminator parameters. While the prospect seems somewhat remote from genuine statistical inference, if the following statement is representative

“GANs transform white noise through a deep neural network to generate candidate samples from a data distribution. A discriminator learns, in a supervised manner, how to tune its parameters so as to correctly classify whether a given sample has come from the generator or the true data distribution. Meanwhile, the generator updates its parameters so as to fool the discriminator. As long as the generator has sufficient capacity, it can approximate the cdf inverse-cdf composition required to sample from a data distribution of interest.”

I figure the concept can also apply to a standard statistical model, where x=G(z,θ) rephrases the distributional assumption x~F(x;θ) via a white noise z. This makes resorting to a prior distribution on θ more relevant in the sense of using potential prior information on θ (although the successes of probabilistic numerics show formal priors can be used on purely numerical ground).

The “posterior distribution” that is central to the notion of Bayesian GANs is however unorthodox in that the distribution is associated with the following conditional posteriors

where D(x,θ) is the “discriminator”, that is, in GAN lingo, the probability to be allocated to the “true” data generating mechanism rather than to the one associated with G(·,θ). The generative conditional posterior (1) then aims at fooling the discriminator, i.e. favours generative parameter values that raise the probability of wrong allocation of the pseudo-data. The discriminative conditional posterior (2) is a standard Bayesian posterior based on the original sample and the generated sample. The authors then iteratively sample from these posteriors, effectively implementing a two-stage Gibbs sampler.

“By iteratively sampling from (1) and (2) at every step of an epoch one can, in the limit, obtain samples from the approximate posteriors over [both sets of parameters].”

What worries me about this approach is that  just cannot work, in the sense that (1) and (2) cannot be compatible conditional (posterior) distributions. There is no joint distribution for which (1) and (2) would be the conditionals, since the pseudo-data appears in D for (1) and (1-D) in (2). This means that the convergence of a Gibbs sampler is at best to a stationary σ-finite measure. And hence that the meaning of the chain is delicate to ascertain… Am I missing any fundamental point?! [I checked the reviews on NIPS webpage and could not spot this issue being raised.]

practical Bayesian inference [book review]

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , , , , on April 26, 2018 by xi'an

[Disclaimer: I received this book of Coryn Bailer-Jones for a review in the International Statistical Review and intend to submit a revised version of this post as my review. As usual, book reviews on the ‘Og are reflecting my own definitely personal and highly subjective views on the topic!]

It is always a bit of a challenge to review introductory textbooks as, on the one hand, they are rarely written at the level and with the focus one would personally choose to write them. And, on the other hand, it is all too easy to find issues with the material presented and the way it is presented… So be warned and proceed cautiously! In the current case, Practical Bayesian Inference tries to embrace too much, methinks, by starting from basic probability notions (that should not be unknown to physical scientists, I believe, and which would avoid introducing a flat measure as a uniform distribution over the real line!, p.20). All the way to running MCMC for parameter estimation, to compare models by Bayesian evidence, and to cover non-parametric regression and bootstrap resampling. For instance, priors only make their apparition on page 71. With a puzzling choice of an improper prior (?) leading to an improper posterior (??), which is certainly not the smoothest entry on the topic. “Improper posteriors are a bad thing“, indeed! And using truncation to turn them into proper distributions is not a clear improvement as the truncation point will significantly impact the inference. Discussing about the choice of priors from the beginning has some appeal, but it may also create confusion in the novice reader (although one never knows!). Even asking about “what is a good prior?” (p.73) is not necessarily the best (and my recommended) approach to a proper understanding of the Bayesian paradigm. And arguing about the unicity of the prior (p.119) clashes with my own view of the prior being primarily a reference measure rather than an ideal summary of the available information. (The book argues at some point that there is no fixed model parameter, another and connected source of disagreement.) There is a section on assigning priors (p.113), but it only covers the case of a possibly biased coin without much realism. A feature common to many Bayesian textbooks though. To return to the issue of improper priors (and posteriors), the book includes several warnings about the danger of hitting an undefined posterior (still called a distribution), without providing real guidance on checking for its definition. (A tough question, to be sure.)

“One big drawback of the Metropolis algorithm is that it uses a fixed step size, the magnitude of which can hardly be determined in advance…”(p.165)

When introducing computational techniques, quadratic (or Laplace) approximation of the likelihood is mingled with kernel estimators, which does not seem appropriate. Proposing to check convergence and calibrate MCMC via ACF graphs is helpful in low dimensions, but not in larger dimensions. And while warning about the dangers of forgetting the Jacobians in the Metropolis-Hastings acceptance probability when using a transform like η=ln θ is well-taken, the loose handling of changes of variables may be more confusing than helpful (p.167). Discussing and providing two R codes for the (standard) Metropolis algorithm may prove too much. Or not. But using a four page R code for fitting a simple linear regression with a flat prior (pp.182-186) may definitely put the reader off! Even though I deem the example a proper experiment in setting a Metropolis algorithm and appreciate the detailed description around the R code itself. (I just take exception at the paragraph on running the code with two or even one observation, as the fact that “the Bayesian solution always exists” (p.188) [under a proper prior] is not necessarily convincing…)

“In the real world we cannot falsify a hypothesis or model any more than we “truthify” it (…) All we can do is ask which of the available models explains the data best.” (p.224)

In a similar format, the discussion on testing of hypotheses starts with a lengthy presentation of classical tests and p-values, the chapter ending up with a list of issues. Most of them reasonable in my own referential. I also concur with the conclusive remarks quoted above that what matters is a comparison of (all relatively false) models. What I less agree [as predictable from earlier posts and papers] with is the (standard) notion that comparing two models with a Bayes factor follows from the no information (in order to avoid the heavily loaded non-informative) prior weights of ½ and ½. Or similarly that the evidence is uniquely calibrated. Or, again, using a truncated improper prior under one of the assumptions (with the ghost of the Jeffreys-Lindley paradox lurking nearby…).  While the Savage-Dickey approximation is mentioned, the first numerical resolution of the approximation to the Bayes factor is via simulations from the priors. Which may be very poor in the situation of vague and uninformative priors. And then the deadly harmonic mean makes an entry (p.242), along with nested sampling… There is also a list of issues about Bayesian model comparison, including (strong) dependence on the prior, dependence on irrelevant alternatives, lack of goodness of fit tests, computational costs, including calls to possibly intractable likelihood function, ABC being then mentioned as a solution (which it is not, mostly).

Continue reading

Bayesian postdoc in Confœderatio Helvetica

Posted in Mountains, Statistics, Travel, University life with tags , , , , , , , , , on April 13, 2018 by xi'an

Antonietta Mira (Università della Svizzera italiana, Lugano) sent me this call for a postdoctoral position between Villigen, near Zürich, hence this picture) and Lugano:

Postdoctoral Fellow: Data Science/Bayesian Inference on Neutron Spectroscopy Data

Your tasks

The increasing availability of empirical large-scale neutron time-of-flight spectroscopy data and steady improvements in computational capacity have resulted in challenges as well as opportunities. This interdisciplinary SDSC-funded project “Bayesian parameter inference from stochastic models (BISTOM)” aims at developing statistical methods and software for analyzing 4D neutron spectroscopy data of quantum magnets. The project is co-directed by Prof. Dr A. Mira (Data Science Center at USI), Dr C. Albert (Eawag), and Prof. Dr Ch. Rüegg (PSI and Univ. Geneva).

Your profile

  • PhD in physics, statistics, applied mathematics or computer science
  • Solid background in neutron spectroscopy or computational statistics/Bayesian inference
  • Strong computational skills
  • Strong scientific writing and communication skills in English

Your working place will be PSI, Villigen and USI, Lugano.

We offer

Our institution is based on an interdisciplinary, innovative and dynamic collaboration. If you wish to optimally combine work and family life or other personal interests, we are able to support you with our modern employment conditions and the on-site infrastructure. Your employment contract is initially limited to 2 years, but may be extended up to 4 years in combination with other grants/fellowships e.g. Marie-Curie.

For further information please contact Prof. Dr Christian Rüegg, phone +41 56 310 47 78. Please submit your application online (including CV, list of publications and addresses of referees) for the position as a Postdoctoral Fellow (index no. 3004-00).

1500 nuances of gan [gan gan style]

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on February 16, 2018 by xi'an

I recently realised that there is a currently very popular trend in machine learning called GAN [for generative adversarial networks] that strongly connects with ABC, at least in that it relies mostly on the availability of a generative model, i.e., a probability model that can be generated as in x=G(ϵ;θ), to draw inference about θ [or predictions]. For instance, there was a GANs tutorial at NIPS 2016 by Ian Goodfellow and many talks on the topic at recent NIPS, the 1500 in the title referring to the citations of the GAN paper by Goodfellow et al. (2014). (The name adversarial comes from opposing true model to generative model in the inference. )

If you remember Jeffreys‘s famous pique about classical tests as being based on improbable events that did not happen, GAN, like ABC,  is sort of the opposite in that it generates events until the one that was observed happens. More precisely, by generating pseudo-samples and switching parameters θ until these samples get as confused as possible between the data generating (“true”) distribution and the generative one. (In its original incarnation, GAN is indeed an optimisation scheme in θ.) A basic presentation of GAN is that it constructs a function D(x,ϕ) that represents the probability that x came from the true model p versus the generative model, ϕ being the parameter of a neural network trained to this effect, aimed at minimising in ϕ a two-term objective function

E[log D(x,ϕ)]+E[log(1D(G(ϵ;θ),ϕ))]

where the first expectation is taken under the true model and the second one under the generative model.

“The discriminator tries to best distinguish samples away from the generator. The generator tries to produce samples that are indistinguishable by the discriminator.” Edward

One ABC perception of this technique is that the confusion rate

E[log(1D(G(ϵ;θ),ϕ))]

is a form of distance between the data and the generative model. Which expectation can be approximated by repeated simulations from this generative model. Which suggests an extension from the optimisation approach to a ABCyesian version by selecting the smallest distances across a range of θ‘s simulated from the prior.

This notion relates to solution using classification tools as density ratio estimation, connecting for instance to Gutmann and Hyvärinen (2012). And ultimately with Geyer’s 1992 normalising constant estimator.

Another link between ABC and networks also came out during that trip. Proposed by Bishop (1994), mixture density networks (MDN) are mixture representations of the posterior [with component parameters functions of the data] trained on the prior predictive through a neural network. These MDNs can be trained on the ABC learning table [based on a specific if redundant choice of summary statistics] and used as substitutes to the posterior distribution, which brings an interesting alternative to Simon Wood’s synthetic likelihood. In a paper I missed Papamakarios and Murray suggest replacing regular ABC with this version…

JASP, a really really fresh way to do stats

Posted in Statistics with tags , , , , , , on February 1, 2018 by xi'an

Bayesian regression trees [seminar]

Posted in pictures, Statistics, University life with tags , , , , , , , , , , on January 26, 2018 by xi'an
During her visit to Paris, Veronika Rockovà (Chicago Booth) will give a talk in ENSAE-CREST on the Saclay Plateau at 2pm. Here is the abstract
Posterior Concentration for Bayesian Regression Trees and Ensembles
(joint with Stephanie van der Pas)Since their inception in the 1980’s, regression trees have been one of the more widely used non-parametric prediction methods. Tree-structured methods yield a histogram reconstruction of the regression surface, where the bins correspond to terminal nodes of recursive partitioning. Trees are powerful, yet  susceptible to over-fitting.  Strategies against overfitting have traditionally relied on  pruning  greedily grown trees. The Bayesian framework offers an alternative remedy against overfitting through priors. Roughly speaking, a good prior  charges smaller trees where overfitting does not occur. While the consistency of random histograms, trees and their ensembles  has been studied quite extensively, the theoretical understanding of the Bayesian counterparts has  been  missing. In this paper, we take a step towards understanding why/when do Bayesian trees and their ensembles not overfit. To address this question, we study the speed at which the posterior concentrates around the true smooth regression function. We propose a spike-and-tree variant of the popular Bayesian CART prior and establish new theoretical results showing that  regression trees (and their ensembles) (a) are capable of recovering smooth regression surfaces, achieving optimal rates up to a log factor, (b) can adapt to the unknown level of smoothness and (c) can perform effective dimension reduction when p>n. These results  provide a piece of missing theoretical evidence explaining why Bayesian trees (and additive variants thereof) have worked so well in practice.