## frontier of simulation-based inference

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , on June 11, 2020 by xi'an

“This paper results from the Arthur M. Sackler Colloquium of the National Academy of Sciences, `The Science of Deep Learning,’ held March 13–14, 2019, at the National Academy of Sciences in Washington, DC.”

A paper by Kyle Cranmer, Johann Brehmer, and Gilles Louppe just appeared in PNAS on the frontier of simulation-based inference. Sounding more like a tribune than a research paper producing new input. Or at least like a review. Providing a quick introduction to simulators, inference, ABC. Stating the shortcomings of simulation-based inference as three-folded:

1. costly, since required a large number of simulated samples
2. loosing information through the use of insufficient summary statistics or poor non-parametric approximations of the sampling density.
3. wasteful as requiring new computational efforts for new datasets, primarily for ABC as learning the likelihood function (as a function of both the parameter θ and the data x) is only done once.

And the difficulties increase with the dimension of the data. While the points made above are correct, I want to note that ideally ABC (and Bayesian inference as a whole) only depends on a single dimension observation, which is the likelihood value. Or more practically that it only depends on the distance from the observed data to the simulated data. (Possibly the Wasserstein distance between the cdfs.) And that, somewhat unrealistically, that ABC could store the reference table once for all. Point 3 can also be debated in that the effort of learning an approximation can only be amortized when exactly the same model is re-employed with new data, which is likely in industrial applications but less in scientific investigations, I would think. About point 2, the paper misses part of the ABC literature on selecting summary statistics, e.g., the culling afforded by random forests ABC, or the earlier use of the score function in Martin et al. (2019).

The paper then makes a case for using machine-, active-, and deep-learning advances to overcome those blocks. Recouping other recent publications and talks (like Dennis on One World ABC’minar!). Once again presenting machine-learning techniques such as normalizing flows as more efficient than traditional non-parametric estimators. Of which I remain unconvinced without deeper arguments [than the repeated mention of powerful machine-learning techniques] on the convergence rates of these estimators (rather than extolling the super-powers of neural nets).

“A classifier is trained using supervised learning to discriminate two sets of data, although in this case both sets come from the simulator and are generated for different parameter points θ⁰ and θ¹. The classifier output function can be converted into an approximation of the likelihood ratio between θ⁰ and θ¹ (…) learning the likelihood or posterior is an unsupervised learning problem, whereas estimating the likelihood ratio through a classifier is an example of supervised learning and often a simpler task.”

The above comment is highly connected to the approach set by Geyer in 1994 and expanded in Gutmann and Hyvärinen in 2012. Interestingly, at least from my narrow statistician viewpoint!, the discussion about using these different types of approximation to the likelihood and hence to the resulting Bayesian inference never engages into a quantification of the approximation or even broaches upon the potential for inconsistent inference unlocked by using fake likelihoods. While insisting on the information loss brought by using summary statistics.

“Can the outcome be trusted in the presence of imperfections such as limited sample size, insufficient network capacity, or inefficient optimization?”

Interestingly [the more because the paper is classified as statistics] the above shows that the statistical question is set instead in terms of numerical error(s). With proposals to address it ranging from (unrealistic) parametric bootstrap to some forms of GANs.

## Naturally amazed at non-identifiability

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on May 27, 2020 by xi'an

A Nature paper by Stilianos Louca and Matthew W. Pennell,  Extant time trees are consistent with a myriad of diversification histories, comes to the extraordinary conclusion that birth-&-death evolutionary models cannot distinguish between several scenarios given the available data! Namely, stem ages and daughter lineage ages cannot identify the speciation rate function λ(.), the extinction rate function μ(.)  and the sampling fraction ρ inherently defining the deterministic ODE leading to the number of species predicted at any point τ in time, N(τ). The Nature paper does not seem to make a point beyond the obvious and I am rather perplexed at why it got published [and even highlighted]. A while ago, under the leadership of Steve, PNAS decided to include statistician reviewers for papers relying on statistical arguments. It could time for Nature to move there as well.

“We thus conclude that two birth-death models are congruent if and only if they have the same rp and the same λp at some time point in the present or past.” [S.1.1, p.4]

Or, stated otherwise, that a tree structured dataset made of branch lengths are not enough to identify two functions that parameterise the model. The likelihood looks like

$\frac{\rho^{n-1}\Psi(\tau_1,\tau_0)}{1-E(\tau)}\prod_{i=1}^n \lambda(\tau_i)\Psi(s_{i,1},\tau_i)\Psi(s_{i,2},\tau_i)$\$

where E(.) is the probability to survive to the present and ψ(s,t) the probability to survive and be sampled between times s and t. Sort of. Both functions depending on functions λ(.) and  μ(.). (When the stem age is unknown, the likelihood changes a wee bit, but with no changes in the qualitative conclusions. Another way to write this likelihood is in term of the speciation rate λp

$e^{-\Lambda_p(\tau_0)}\prod_{i=1}^n\lambda_p(\tau_I)e^{-\Lambda_p(\tau_i)}$

where Λp is the integrated rate, but which shares the same characteristic of being unable to identify the functions λ(.) and μ(.). While this sounds quite obvious the paper (or rather the supplementary material) goes into fairly extensive mode, including “abstract” algebra to define congruence.

“…we explain why model selection methods based on parsimony or “Occam’s razor”, such as the Akaike Information Criterion and the Bayesian Information Criterion that penalize excessive parameters, generally cannot resolve the identifiability issue…” [S.2, p15]

As illustrated by the above quote, the supplementary material also includes a section about statistical model selections techniques failing to capture the issue, section that seems superfluous or even absurd once the fact that the likelihood is constant across a congruence class has been stated.

## value of a chess game

Posted in pictures, Statistics, University life with tags , , , , , , , , , , , , on April 15, 2020 by xi'an

In our (internal) webinar at CEREMADE today, Miguel Oliu Barton gave a talk on the recent result his student Luc Attia and himself obtained, namely a tractable way of finding the value of a game (when minimax equals maximin), result that got recently published in PNAS:

“Stochastic games were introduced by the Nobel Memorial Prize winner Lloyd Shapley in 1953 to model dynamic interactions in which the environment changes in response to the players’ behavior. The theory of stochastic games and its applications have been studied in several scientific disciplines, including economics, operations research, evolutionary biology, and computer science. In addition, mathematical tools that were used and developed in the study of stochastic games are used by mathematicians and computer scientists in other fields. This paper contributes to the theory of stochastic games by providing a tractable formula for the value of finite competitive stochastic games. This result settles a major open problem which remained unsolved for nearly 40 years.”

While I did not see a direct consequence of this result in regular statistics, I found most interesting the comment made at one point that chess (with forced nullity after repetitions) had a value, by virtue of Zermelo’s theorem. As I had never considered the question (contrary to Shannon!). This value remains unknown.

## mining gold [ABC in PNAS]

Posted in Books, Statistics with tags , , , , , , , , , , , on March 13, 2020 by xi'an

Johann Brehmer and co-authors have just published a paper in PNAS entitled “Mining gold from implicit models to improve likelihood-free inference”. (Besides the pun about mining gold, the paper also involves techniques named RASCAL and SCANDAL, respectively! For Ratio And SCore Approximate Likelihood ratio and SCore-Augmented Neural Density Approximates Likelihood.) This setup is not ABC per se in that their simulator is used both to generate training data and construct a tractable surrogate model. Exploiting Geyer’s (1994) classification trick of expressing the likelihood ratio as the optimal classification ratio when facing two equal-size samples from one density and the other.

“For all these inference strategies, the augmented data is particularly powerful for enhancing the power of simulation-based inference for small changes in the parameter θ.”

Brehmer et al. argue that “the most important novel contribution that differentiates our work from the existing methods is the observation that additional information can be extracted from the simulator, and the development of loss functions that allow us to use this “augmented” data to more efficiently learn surrogates for the likelihood function.” Rather than starting from a statistical model, they also seem to use a scientific simulator made of multiple layers of latent variables z, where

x=F⁰(u⁰,z¹,θ), z¹=G¹(u¹,z²), z²=G¹(u²,z³), …

although they also call the marginal of x, p(x|θ), an (intractable) likelihood.

“The integral of the log is not the log of the integral!”

The central notion behind the improvement is a form of Rao-Blackwellisation, exploiting the simulated z‘s. Joint score functions and joint likelihood ratios are then available. Ignoring biases, the authors demonstrate that the closest approximation to the joint likelihood ratio and the joint score function that only depends on x is the actual likelihood ratio and the actual score function, respectively. Which sounds like an older EM result, except that the roles of estimate and target quantity are somehow inverted: one is approximating the marginal with the joint, while the marginal is the “best” approximation of the joint. But in the implementation of the method, an estimate of the (observed and intractable) likelihood ratio is indeed produced towards minimising an empirical loss based on two simulated samples. Learning this estimate ê(x) then allows one to use it for the actual data. It however requires fitting a new ê(x) for each pair of parameters. Providing as well an estimator of the likelihood p(x|θ). (Hence the SCANDAL!!!) A second type of approximation of the likelihood starts from the approximate value of the likelihood p(x|θ⁰) at a fixed value θ⁰ and expands it locally as an exponential family shift, with the score t(x|θ⁰) as sufficient statistic.

I find the paper definitely interesting even though it requires the representation of the (true) likelihood as a marginalisation over multiple layers of latent variables z. And does not provide an evaluation of the error involved in the process when the model is misspecified. As a minor supplementary appeal of the paper, the use of an asymmetric Galton quincunx to illustrate an intractable array of latent variables will certainly induce me to exploit it in projects and courses!

[Disclaimer: I was not involved in the PNAS editorial process at any point!]