## Archive for probabilistic programming

## probabilistic programming au collège [de France]

Posted in Statistics with tags Andrew, Collège de France, Galton Board, ISBA 2022, Montréal, probabilistic programming, quincunx on June 24, 2022 by xi'an## prior elicitation

Posted in Books, Kids, Statistics, University life with tags ABC, Bayesian methods and expert elicitation, cognitive biases, conflicting prior, consensus prior, curse of dimensionality, prior elicitation, prior predictive, probabilistic programming, STAN, startup, summary statistics, whales, xkcd on January 13, 2022 by xi'an

“We believe that an elicitation method should support elicitation both in the parameter and observable space, should be model-agnostic, and should be sample-efficient since human effort is costly.”

**P**etrus Mikkola *et al.* arXived a long paper on prior elicitation addressing the (most relevant) question: *Why are we not widely use prior elicitation? *With a massive bibliography that could be (partly) commented (and corrected as some references are incomplete, as eg my book chapter on priors!). I think the paper would make a terrific discussion paper.

The absence of a general procedure for prior elicitation is indeed hindering the adoption of Bayesian methods outside our core community and is thus eventually detrimental to their wider development. It also carries the dangers of misled or misleading prior choices. The authors put forward the absence of “software that integrates well with the current probabilistic programming tools used for other parts of the modelling workflow.” This requires setting principles that avoid “just-press-key” solutions. (Aside: This reminds me of my very first prospective PhD student, who was then working in a startup [although the name was not yet in use in the early 1990’s!] and had build such a software in a discretised, low dimension, conjugate prior, environment by returning a form of decision-theoretic impact of the chosen hyperparameters. He alas aborted his PhD attempt due to the short-term pressing matters in the under-staffed company…)

“We inspect prior elicitation from the perspectives of (1) properties of the prior distribution itself, (2) the model family and the prior elicitation method’s dependence on it, (3) the underlying elicitation space, (4) how the method interprets the information provided by the expert, (5) computation, (6) the form and quantity of interaction with the expert(s), and (7) the assumed capability of the expert (…)”

Prior elicitation is indeed a delicate balance between incorporating expert opinion(s) and avoiding over-standardisation. In my limited experience, experts tend to be over-confident about their own opinion and unwilling to attach uncertainty to their assessments. Even when being inconsistent. When several experts are involved (as, very briefly, in Section 3.6), building a common prior quickly becomes a challenge, esp. if their interests (or utility functions) diverge. As illustrated in the case of the whaling commission analysed by Adrian Raftery in the late 1990’s. (The above quote involves a single expert.) Actually, I dislike the term *expert* altogether, as it comes without any grading of the reliability of the person.To hit (!) at an early statement in the paper (p.5), should the prior elicitation *always* depend on the (sampling) model, as experts may ignore or misapprehend the model? The posterior already accounts for the likelihood and the parameter may pre-exist wrt the model, as eg cosmological constants or vaccine efficiency… In a sense, the model should be involved as little as possible in the elicitation as the expert could confuse her beliefs about the parameter with those about the accuracy of the model. (I realise this is not necessarily a mainstream position as illustrated by this paper by Andrew and friends!)

And isn’t the first stumbling block the inability of most to represent one’s prior knowledge in probabilistic terms? Innumeracy is a shared shortcoming in the general population (and since everyone’s an expert!), as repeatedly demonstrated since the start of the Covid-19 pandemic. (See also the above point about inconsistency. Accounting for such inconsistencies in a Bayesian way is a natural answer, albeit requiring the degree of expertise and reliability to be tested.)

Is prior elicitation feasible beyond a few dimensions? Even when using the constrictive tool of copulas one hits a wall after a few dimensions, assuming the expert is willing to set a prior correlation matrix. Most of the methods described in Section 3.1 only apply to textbook examples. In their third dimension (!), the authors mention neural network parameters but later fail to cover this type of issue. (This was the example I had in mind indeed.) And they move from parameter space to observable space. Distinguishing *predictive* elicitation from *observational* elicitation, the former being what I would have suggested from scratch. Obviously, the curse of dimensionality strikes again unless one considers summary statistics (like in ABC).

While I am glad conjugate priors do not get the lion’s share, using as in Section 3.3.. non-parametric or machine learning solutions to construct the prior sounds unrealistic. (And including maximum entropy priors into that category seems wrong since they are definitely parametric.)

The proposed Bayesian treatment of the expert’s “data” (Section 4.1) is rational but requires an additional model construct to link the expert’s data with the parameter to reach a Bayes formula like (4.1). Plus a primary prior (which could then be one of the reference priors.) Reducing the expert’s input to imaginary observations may prove too narrow, though. The notion of an iterative elicitation is most appealing and its sequential aspect may not be particularly problematic in opposition to posteriors relying on using the data twice or more. I am much less buying the hierarchical construct of Section 4.3 because they imply a return to conjugate priors and hyperpriors, are not necessarily correctly understood by experts, do not always cater to observational elicitation, and are not an answer to high-dimension challenges.

Given the state of the art, it sounds like we are still far from seeing prior elicitation as a natural part of Bayesian software and probabilistic programming. Even when using a modular, model-agnostic strategy. But this is most certainly a worthy prospect!

## scalable Metropolis-Hastings, nested Monte Carlo, and normalising flows

Posted in Books, pictures, Statistics, University life with tags Bayesian neural networks, Bernstein-von Mises theorem, CIF, computing cost, conferences, density approximation, dissertation, doubly intractable posterior, evidence, ICML 2019, ICML 2020, image analysis, International Conference on Machine Learning, L¹ convergence, logistic regression, nesting Monte Carlo, normalising flow, PhD, probabilistic programming, quarantine, SAME algorithm, scalable MCMC, thesis defence, University of Oxford, variational autoencoders, viva on June 16, 2020 by xi'an**O**ver a sunny if quarantined Sunday, I started reading the PhD dissertation of Rob Cornish, Oxford University, as I am the external member of his viva committee. Ending up in a highly pleasant afternoon discussing this thesis over a (remote) viva yesterday. (If bemoaning a lost opportunity to visit Oxford!) The introduction to the viva was most helpful and set the results within the different time and geographical zones of the Ph.D since Rob had to switch from one group of advisors in Engineering to another group in Statistics. Plus an encompassing prospective discussion, expressing pessimism at exact MCMC for complex models and looking forward further advances in probabilistic programming.

Made of three papers, the thesis includes this ICML 2019 [remember the era when there were conferences?!] paper on scalable Metropolis-Hastings, by Rob Cornish, Paul Vanetti, Alexandre Bouchard-Côté, Georges Deligiannidis, and Arnaud Doucet, which I commented last year. Which achieves a remarkable and paradoxical O(1/√n) cost per iteration, provided (global) lower bounds are found on the (local) Metropolis-Hastings acceptance probabilities since they allow for Poisson thinning à la Devroye (1986) and second order Taylor expansions constructed for all components of the target, with the third order derivatives providing bounds. However, the variability of the acceptance probability gets higher, which induces a longer but still manageable if the concentration of the posterior is in tune with the Bernstein von Mises asymptotics. I had not paid enough attention in my first read at the strong theoretical justification for the method, relying on the convergence of MAP estimates in well- and (some) mis-specified settings. Now, I would have liked to see the paper dealing with a more complex problem that logistic regression.

The second paper in the thesis is an ICML 2018 proceeding by Tom Rainforth, Robert Cornish, Hongseok Yang, Andrew Warrington, and Frank Wood, which considers Monte Carlo problems involving several *nested* expectations in a non-linear manner, meaning that (a) several levels of Monte Carlo approximations are required, with associated asymptotics, and (b) the resulting overall estimator is biased. This includes common doubly intractable posteriors, obviously, as well as (Bayesian) design and control problems. [And it has nothing to do with nested sampling.] The resolution chosen by the authors is strictly plug-in, in that they replace each level in the nesting with a Monte Carlo substitute and do not attempt to reduce the bias. Which means a wide range of solutions (other than the plug-in one) could have been investigated, including bootstrap maybe. For instance, Bayesian design is presented as an application of the approach, but since it relies on the log-evidence, there exist several versions for estimating (unbiasedly) this log-evidence. Similarly, the Forsythe-von Neumann technique applies to arbitrary transforms of a primary integral. The central discussion dwells on the optimal choice of the volume of simulations at each level, optimal in terms of asymptotic MSE. Or rather asymptotic bound on the MSE. The interesting result being that the outer expectation requires the square of the number of simulations for the other expectations. Which all need converge to infinity. A trick in finding an estimator for a polynomial transform reminded me of the SAME algorithm in that it duplicated the simulations as many times as the highest power of the polynomial. (The ‘Og briefly reported on this paper… four years ago.)

The third and last part of the thesis is a proposal [to appear in ICML 20] on relaxing bijectivity constraints in normalising flows with continuously index flows. (Or CIF. As Rob made a joke about this cleaning brand, let me add (?) to that joke by mentioning that looking at CIF and *bijections* is less dangerous in a Trump cum COVID era at CIF and *injections*!) With Anthony Caterini, George Deligiannidis and Arnaud Doucet as co-authors. I am much less familiar with this area and hence a wee bit puzzled at the purpose of removing what I understand to be an appealing side of normalising flows, namely to produce a manageable representation of density functions as a combination of bijective and differentiable functions of a baseline random vector, like a standard Normal vector. The argument made in the paper is that imposing this representation of the density imposes a constraint on the topology of its support since said support is homeomorphic to the support of the baseline random vector. While the supporting theoretical argument is a mathematical theorem that shows the Lipschitz bound on the transform should be infinity in the case the supports are topologically different, these arguments may be overly theoretical when faced with the practical implications of the replacement strategy. I somewhat miss its overall strength given that the whole point seems to be in approximating a density function, based on a finite sample.

## Bayesian conjugate gradients [open for discussion]

Posted in Books, pictures, Statistics, University life with tags Bayesian Analysis, Bayesian methods for hackers, discussion paper, probabilistic numerics, probabilistic programming, University of Warwick on June 25, 2019 by xi'an**W**hen fishing for an illustration for this post on Google, I came upon this Bayesian methods for hackers cover, a book about which I have no clue whatsoever (!) but that mentions probabilistic programming. Which serves as a perfect (?!) introduction to the call for discussion in Bayesian Analysis of the incoming Bayesian conjugate gradient method by Jon Cockayne, Chris Oates (formerly Warwick), Ilse Ipsen and Mark Girolami (still partially Warwick!). Since indeed the paper is about probabilistic numerics à la Mark and co-authors. Surprisingly dealing with solving the deterministic equation Ax=b by Bayesian methods. The method produces a posterior distribution on the solution x⁰, given a fixed computing effort, which makes it pertain to the anytime algorithms. It also relates to an earlier 2015 paper by Christian Hennig where the posterior is on A⁻¹ rather than x⁰ (which is quite a surprising if valid approach to the problem!) The computing effort is translated here in computations of projections of random projections of Ax, which can be made compatible with conjugate gradient steps. Interestingly, the choice of the prior on x is quite important, including setting a low or high convergence rate… **Deadline is August 04!**

## Elves to the ABC rescue!

Posted in Books, Kids, Statistics with tags ABC, ELFI, Finnish Elves, gaussian process, Mauri Kunnas, probabilistic programming, software on November 7, 2018 by xi'anMarko Järvenpää, Michael Gutmann, Arijus Pleska, Aki Vehtari, and Pekka Marttinen have written a paper on Efficient Acquisition Rules for Model-Based Approximate Bayesian Computation soon to appear in Bayesian Analysis that gives me the right nudge to mention the ELFI software they have been contributing to for a while. Where the acronym stands for engine for likelihood-free inference. Written in Python, DAG based, and covering methods like the

- ABC rejection sampler
- Sequential Monte Carlo ABC sampler
- Bayesian Optimization for Likelihood-Free Inference (BOLFI) framework
- Bayesian Optimization (not likelihood-free)
- No-U-Turn-Sampler (not likelihood-free)

[Warning: I did not experiment with the software! Feel free to share.]

“…little work has focused on trying to quantify the amount of uncertainty in the estimator of the ABC posterior density under the chosen modelling assumptions. This uncertainty is due to a finite computational budget to perform the inference and could be thus also called as computational uncertainty.”

The paper is about looking at the “real” ABC distribution, that is, the one resulting from a realistic perspective of a finite number of simulations and acceptances. By acquisition, the authors mean an efficient way to propose the next value of the parameter θ, towards minimising the uncertainty in the ABC density estimate. Note that this involves a loss function that must be chosen by the analyst and then available for the minimisation program. If this sounds complicated…

“…our interest is to design the evaluations to minimise the uncertainty in a quantity that itself describes the uncertainty of the parameters of a costly simulation model.”

it indeed is and it requires modelling choices. As in Guttman and Corander (2016), which was also concerned by designing the location of the learning parameters, the modelling is based here on a Gaussian process for the discrepancy between the observed and the simulated data. Which provides an estimate of the likelihood, later used for selecting the next sampling value of θ. The final ABC sample is however produced by a GP estimation of the ABC distribution.As noted by the authors, the method may prove quite time consuming: for instance, one involved model required one minute of computation time for selecting the next evaluation location. (I had a bit of a difficulty when reading the paper as I kept hitting notions that are local to the paper but not immediately or precisely defined. As “adequation function” [p.11] or “discrepancy”. Maybe correlated with short nights while staying at CIRM for the Masterclass, always waking up around 4am for unknown reasons!)