Archive for probabilistic programming

scalable Metropolis-Hastings, nested Monte Carlo, and normalising flows

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , , , , , , , on June 16, 2020 by xi'an

Over a sunny if quarantined Sunday, I started reading the PhD dissertation of Rob Cornish, Oxford University, as I am the external member of his viva committee. Ending up in a highly pleasant afternoon discussing this thesis over a (remote) viva yesterday. (If bemoaning a lost opportunity to visit Oxford!) The introduction to the viva was most helpful and set the results within the different time and geographical zones of the Ph.D since Rob had to switch from one group of advisors in Engineering to another group in Statistics. Plus an encompassing prospective discussion, expressing pessimism at exact MCMC for complex models and looking forward further advances in probabilistic programming.

Made of three papers, the thesis includes this ICML 2019 [remember the era when there were conferences?!] paper on scalable Metropolis-Hastings, by Rob Cornish, Paul Vanetti, Alexandre Bouchard-Côté, Georges Deligiannidis, and Arnaud Doucet, which I commented last year. Which achieves a remarkable and paradoxical O(1/√n) cost per iteration, provided (global) lower bounds are found on the (local) Metropolis-Hastings acceptance probabilities since they allow for Poisson thinning à la Devroye (1986) and  second order Taylor expansions constructed for all components of the target, with the third order derivatives providing bounds. However, the variability of the acceptance probability gets higher, which induces a longer but still manageable if the concentration of the posterior is in tune with the Bernstein von Mises asymptotics. I had not paid enough attention in my first read at the strong theoretical justification for the method, relying on the convergence of MAP estimates in well- and (some) mis-specified settings. Now, I would have liked to see the paper dealing with a more complex problem that logistic regression.

The second paper in the thesis is an ICML 2018 proceeding by Tom Rainforth, Robert Cornish, Hongseok Yang, Andrew Warrington, and Frank Wood, which considers Monte Carlo problems involving several nested expectations in a non-linear manner, meaning that (a) several levels of Monte Carlo approximations are required, with associated asymptotics, and (b) the resulting overall estimator is biased. This includes common doubly intractable posteriors, obviously, as well as (Bayesian) design and control problems. [And it has nothing to do with nested sampling.] The resolution chosen by the authors is strictly plug-in, in that they replace each level in the nesting with a Monte Carlo substitute and do not attempt to reduce the bias. Which means a wide range of solutions (other than the plug-in one) could have been investigated, including bootstrap maybe. For instance, Bayesian design is presented as an application of the approach, but since it relies on the log-evidence, there exist several versions for estimating (unbiasedly) this log-evidence. Similarly, the Forsythe-von Neumann technique applies to arbitrary transforms of a primary integral. The central discussion dwells on the optimal choice of the volume of simulations at each level, optimal in terms of asymptotic MSE. Or rather asymptotic bound on the MSE. The interesting result being that the outer expectation requires the square of the number of simulations for the other expectations. Which all need converge to infinity. A trick in finding an estimator for a polynomial transform reminded me of the SAME algorithm in that it duplicated the simulations as many times as the highest power of the polynomial. (The ‘Og briefly reported on this paper… four years ago.)

The third and last part of the thesis is a proposal [to appear in ICML 20] on relaxing bijectivity constraints in normalising flows with continuously index flows. (Or CIF. As Rob made a joke about this cleaning brand, let me add (?) to that joke by mentioning that looking at CIF and bijections is less dangerous in a Trump cum COVID era at CIF and injections!) With Anthony Caterini, George Deligiannidis and Arnaud Doucet as co-authors. I am much less familiar with this area and hence a wee bit puzzled at the purpose of removing what I understand to be an appealing side of normalising flows, namely to produce a manageable representation of density functions as a combination of bijective and differentiable functions of a baseline random vector, like a standard Normal vector. The argument made in the paper is that imposing this representation of the density imposes a constraint on the topology of its support since said support is homeomorphic to the support of the baseline random vector. While the supporting theoretical argument is a mathematical theorem that shows the Lipschitz bound on the transform should be infinity in the case the supports are topologically different, these arguments may be overly theoretical when faced with the practical implications of the replacement strategy. I somewhat miss its overall strength given that the whole point seems to be in approximating a density function, based on a finite sample.

Bayesian conjugate gradients [open for discussion]

Posted in Books, pictures, Statistics, University life with tags , , , , , on June 25, 2019 by xi'an

When fishing for an illustration for this post on Google, I came upon this Bayesian methods for hackers cover, a book about which I have no clue whatsoever (!) but that mentions probabilistic programming. Which serves as a perfect (?!) introduction to the call for discussion in Bayesian Analysis of the incoming Bayesian conjugate gradient method by Jon Cockayne, Chris Oates (formerly Warwick), Ilse Ipsen and Mark Girolami (still partially Warwick!). Since indeed the paper is about probabilistic numerics à la Mark and co-authors. Surprisingly dealing with solving the deterministic equation Ax=b by Bayesian methods. The method produces a posterior distribution on the solution x⁰, given a fixed computing effort, which makes it pertain to the anytime algorithms. It also relates to an earlier 2015 paper by Christian Hennig where the posterior is on A⁻¹ rather than x⁰ (which is quite a surprising if valid approach to the problem!) The computing effort is translated here in computations of projections of random projections of Ax, which can be made compatible with conjugate gradient steps. Interestingly, the choice of the prior on x is quite important, including setting a low or high convergence rate…  Deadline is August 04!

Elves to the ABC rescue!

Posted in Books, Kids, Statistics with tags , , , , , , on November 7, 2018 by xi'an

Marko Järvenpää, Michael Gutmann, Arijus Pleska, Aki Vehtari, and Pekka Marttinen have written a paper on Efficient Acquisition Rules for Model-Based Approximate Bayesian Computation soon to appear in Bayesian Analysis that gives me the right nudge to mention the ELFI software they have been contributing to for a while. Where the acronym stands for engine for likelihood-free inference. Written in Python, DAG based, and covering methods like the

  • ABC rejection sampler
  • Sequential Monte Carlo ABC sampler
  • Bayesian Optimization for Likelihood-Free Inference (BOLFI) framework
  • Bayesian Optimization (not likelihood-free)
  • No-U-Turn-Sampler (not likelihood-free)

[Warning: I did not experiment with the software! Feel free to share.]

“…little work has focused on trying to quantify the amount of uncertainty in the estimator of the ABC posterior density under the chosen modelling assumptions. This uncertainty is due to a finite computational budget to perform the inference and could be thus also called as computational uncertainty.”

The paper is about looking at the “real” ABC distribution, that is, the one resulting from a realistic perspective of a finite number of simulations and acceptances. By acquisition, the authors mean an efficient way to propose the next value of the parameter θ, towards minimising the uncertainty in the ABC density estimate. Note that this involves a loss function that must be chosen by the analyst and then available for the minimisation program. If this sounds complicated…

“…our interest is to design the evaluations to minimise the uncertainty in a quantity that itself describes the uncertainty of the parameters of a costly simulation model.”

it indeed is and it requires modelling choices. As in Guttman and Corander (2016), which was also concerned by designing the location of the learning parameters, the modelling is based here on a Gaussian process for the discrepancy between the observed and the simulated data. Which provides an estimate of the likelihood, later used for selecting the next sampling value of θ. The final ABC sample is however produced by a GP estimation of the ABC distribution.As noted by the authors, the method may prove quite time consuming: for instance, one involved model required one minute of computation time for selecting the next evaluation location. (I had a bit of a difficulty when reading the paper as I kept hitting notions that are local to the paper but not immediately or precisely defined. As “adequation function” [p.11] or “discrepancy”. Maybe correlated with short nights while staying at CIRM for the Masterclass, always waking up around 4am for unknown reasons!)

IMS workshop [day 3]

Posted in pictures, R, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , on August 30, 2018 by xi'an

I made the “capital” mistake of walking across the entire NUS campus this morning, which is quite green and pretty, but which almost enjoys an additional dimension brought by such an intense humidity that one feels having to get around this humidity!, a feature I have managed to completely erase from my memory of my previous visit there. Anyway, nothing of any relevance. oNE talk in the morning was by Markus Eisenbach on tools used by physicists to speed up Monte Carlo methods, like the Wang-Landau flat histogram, towards computing the partition function, or the distribution of the energy levels, definitely addressing issues close to my interest, but somewhat beyond my reach for using a different language and stress, as often in physics. (I mean, as often in physics talks I attend.) An idea that came out clear to me was to bypass a (flat) histogram target and aim directly at a constant slope cdf for the energy levels. (But got scared away by the Fourier transforms!)

Lawrence Murray then discussed some features of the Birch probabilistic programming language he is currently developing, especially a fairly fascinating concept of delayed sampling, which connects with locally-optimal proposals and Rao Blackwellisation. Which I plan to get back to later [and hopefully sooner than later!].

In the afternoon, Maria de Iorio gave a talk about the construction of nonparametric priors that create dependence between a sequence of functions, a notion I had not thought of before, with an array of possibilities when using the stick breaking construction of Dirichlet processes.

And Christophe Andrieu gave a very smooth and helpful entry to partly deterministic Markov processes (PDMP) in preparation for talks he is giving next week for the continuation of the workshop at IMS. Starting with the guided random walk of Gustafson (1998), which extended a bit later into the non-reversible paper of Diaconis, Holmes, and Neal (2000). Although I had a vague idea of the contents of these papers, the role of the velocity ν became much clearer. And premonitory of the advances made by the more recent PDMP proposals. There is obviously a continuation with the equally pedagogical talk Christophe gave at MCqMC in Rennes two months [and half the globe] ago,  but the focus being somewhat different, it really felt like a new talk [my short term memory may also play some role in this feeling!, as I now remember the discussion of Hilderbrand (2002) for non-reversible processes]. An introduction to the topic I would recommend to anyone interested in this new branch of Monte Carlo simulation! To be followed by the most recently arXived hypocoercivity paper by Christophe and co-authors.

Bayesian program synthesis

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , on April 7, 2017 by xi'an

Last week, I—along with Jean-Michel Marin—got an email from a journalist working for Science & Vie, a French sciences journal that published a few years ago a special issue on Bayes’ theorem. (With the insane title of “the formula that deciphers the World!”) The reason for this call was the preparation of a paper on Gamalon, a new AI company that relies on (Bayesian) probabilistic programming to devise predictive tools. And spent an hour skyping with him about Bayesian inference, probabilistic programming and machine-learning, at the general level since we had not heard previously of this company or of its central tool.

“the Gamalon BPS system learns from only a few examples, not millions. It can learn using a tablet processor, not hundreds of servers. It learns right away while we play with it, not over weeks or months. And it learns from just one person, not from thousands.”

Gamalon claims to do much better than deep learning at those tasks. Not that I have reasons to doubt that claim, quite the opposite, an obvious reason being that incorporating rules and probabilistic models in the predictor is going to help if these rule and models are even moderately realistic, another major one being that handling uncertainty and learning by Bayesian tools is usually a good idea (!), and yet another significant one being that David Blei is a member of their advisory committee. But it is hard to get a feeling for such claims when the only element in the open is the use of probabilistic programming, which is an advanced and efficient manner of conducting model building and updating and handling (posterior) distributions as objects, but which does not enjoy higher predictives abilities by default. Unless I live with a restricted definition of what probabilistic programming stands for! In any case, the video provided by Gamalon and the presentation given by its CEO do not help in my understanding of the principles behind this massive gain in efficiency. Which makes sense given that the company would not want to give up their edge on the competition.

Incidentally, the video in this presentation comparing the predictive abilities of the four major astronomical explanations of the solar system is great. If not particularly connected with the difference between deep learning and Bayesian probabilistic programming.