Another terrible graph from Nature… With vertical bars meaning nothing. Nothing more than the list of three values and both confidence intervals. But the associated article is quite interesting in investigating the difficulties in assessing the number of deaths due to COVID-19, when official death statistics are (almost) as shaky as the official COVID-19 deaths. Even in countries with sound mortality statistics and trustworthy official statistics institutes. This article opposes prediction models run by the Institute for Health Metrics and Evaluation and The Economist. The later being a machine-learning prediction procedure based on a large number of covariates. Without looking under the hood, it is unclear to me how poor entries across the array of covariates can be corrected to return a meaningful prediction. It is also striking that the model predicts much less excess deaths than those due to COVID-19 in a developed country like Japan. Survey methods are briefly mentioned at the end of the article, with interesting attempts to use satellite images of burial grounds, but no further techniques like capture-recapture or record linkage and entity resolution.

## Archive for image analysis

## how to count excess deaths?

Posted in Books, Kids, pictures, Statistics with tags bad graph, capture-recapture, COVID-19, death certificate, entity resolution, excess mortality, image analysis, infographics, INSEE, Japan, Jeune economist, Nature, official statistics, record linkage, satellite, The Economist on February 17, 2022 by xi'an## statistical analysis of GANs

Posted in Books, Statistics with tags Annals of Statistics, asymptotics, GANs, generative discriminative networks, image analysis, Kullback-Leibler divergence, machine learning, neural network, Université Paris-La Sorbonne on May 24, 2021 by xi'an**M**y friend Gérard Biau and his coauthors have published a paper in the Annals of Statistics last year on the theoretical [statistical] analysis of GANs, which I had missed and recently read with a definitive interest in the issues. (With no image example!)

If the discriminator is unrestricted the unique optimal solution is the Bayes posterior probability

when the model density is everywhere positive. And the optimal parameter θ corresponds to the closest model in terms of Kullback-Leibler divergence. The pseudo-true value of the parameter. This is however the ideal situation, while in practice D is restricted to a parametric family. In this case, if the family is wide enough to approximate the ideal discriminator in the sup norm, with error of order ε, and if the parameter space Θ is compact, the optimal parameter found under the restricted family approximates the pseudo-true value in the sense of the GAN loss, at the order ε². With a stronger assumption on the family ability to approximate any discriminator, the same property holds for the empirical version (and in expectation). (As an aside, the figure illustrating this property confusedly uses an histogramesque rectangle to indicate the expectation of the discriminator loss!) And both parameter (θ and α) estimators converge to the optimal ones with the sample size. An interesting foray from statisticians in a method whose statistical properties are rarely if ever investigated. Missing a comparison with alternative approaches, like MLE, though.

## scalable Metropolis-Hastings, nested Monte Carlo, and normalising flows

Posted in Books, pictures, Statistics, University life with tags Bayesian neural networks, Bernstein-von Mises theorem, CIF, computing cost, conferences, density approximation, dissertation, doubly intractable posterior, evidence, ICML 2019, ICML 2020, image analysis, International Conference on Machine Learning, L¹ convergence, logistic regression, nesting Monte Carlo, normalising flow, PhD, probabilistic programming, quarantine, SAME algorithm, scalable MCMC, thesis defence, University of Oxford, variational autoencoders, viva on June 16, 2020 by xi'an**O**ver a sunny if quarantined Sunday, I started reading the PhD dissertation of Rob Cornish, Oxford University, as I am the external member of his viva committee. Ending up in a highly pleasant afternoon discussing this thesis over a (remote) viva yesterday. (If bemoaning a lost opportunity to visit Oxford!) The introduction to the viva was most helpful and set the results within the different time and geographical zones of the Ph.D since Rob had to switch from one group of advisors in Engineering to another group in Statistics. Plus an encompassing prospective discussion, expressing pessimism at exact MCMC for complex models and looking forward further advances in probabilistic programming.

Made of three papers, the thesis includes this ICML 2019 [remember the era when there were conferences?!] paper on scalable Metropolis-Hastings, by Rob Cornish, Paul Vanetti, Alexandre Bouchard-Côté, Georges Deligiannidis, and Arnaud Doucet, which I commented last year. Which achieves a remarkable and paradoxical O(1/√n) cost per iteration, provided (global) lower bounds are found on the (local) Metropolis-Hastings acceptance probabilities since they allow for Poisson thinning à la Devroye (1986) and second order Taylor expansions constructed for all components of the target, with the third order derivatives providing bounds. However, the variability of the acceptance probability gets higher, which induces a longer but still manageable if the concentration of the posterior is in tune with the Bernstein von Mises asymptotics. I had not paid enough attention in my first read at the strong theoretical justification for the method, relying on the convergence of MAP estimates in well- and (some) mis-specified settings. Now, I would have liked to see the paper dealing with a more complex problem that logistic regression.

The second paper in the thesis is an ICML 2018 proceeding by Tom Rainforth, Robert Cornish, Hongseok Yang, Andrew Warrington, and Frank Wood, which considers Monte Carlo problems involving several *nested* expectations in a non-linear manner, meaning that (a) several levels of Monte Carlo approximations are required, with associated asymptotics, and (b) the resulting overall estimator is biased. This includes common doubly intractable posteriors, obviously, as well as (Bayesian) design and control problems. [And it has nothing to do with nested sampling.] The resolution chosen by the authors is strictly plug-in, in that they replace each level in the nesting with a Monte Carlo substitute and do not attempt to reduce the bias. Which means a wide range of solutions (other than the plug-in one) could have been investigated, including bootstrap maybe. For instance, Bayesian design is presented as an application of the approach, but since it relies on the log-evidence, there exist several versions for estimating (unbiasedly) this log-evidence. Similarly, the Forsythe-von Neumann technique applies to arbitrary transforms of a primary integral. The central discussion dwells on the optimal choice of the volume of simulations at each level, optimal in terms of asymptotic MSE. Or rather asymptotic bound on the MSE. The interesting result being that the outer expectation requires the square of the number of simulations for the other expectations. Which all need converge to infinity. A trick in finding an estimator for a polynomial transform reminded me of the SAME algorithm in that it duplicated the simulations as many times as the highest power of the polynomial. (The ‘Og briefly reported on this paper… four years ago.)

The third and last part of the thesis is a proposal [to appear in ICML 20] on relaxing bijectivity constraints in normalising flows with continuously index flows. (Or CIF. As Rob made a joke about this cleaning brand, let me add (?) to that joke by mentioning that looking at CIF and *bijections* is less dangerous in a Trump cum COVID era at CIF and *injections*!) With Anthony Caterini, George Deligiannidis and Arnaud Doucet as co-authors. I am much less familiar with this area and hence a wee bit puzzled at the purpose of removing what I understand to be an appealing side of normalising flows, namely to produce a manageable representation of density functions as a combination of bijective and differentiable functions of a baseline random vector, like a standard Normal vector. The argument made in the paper is that imposing this representation of the density imposes a constraint on the topology of its support since said support is homeomorphic to the support of the baseline random vector. While the supporting theoretical argument is a mathematical theorem that shows the Lipschitz bound on the transform should be infinity in the case the supports are topologically different, these arguments may be overly theoretical when faced with the practical implications of the replacement strategy. I somewhat miss its overall strength given that the whole point seems to be in approximating a density function, based on a finite sample.

## indecent exposure

Posted in Statistics with tags ABC, Bayesian optimisation, Bretagne, Brittany, exponential families, image analysis, image processing, inference, Lugano, maximum likelihood estimation, MCqMC 2018, pre-processing, Rennes on July 27, 2018 by xi'an**W**hile attending my last session at MCqMC 2018, in Rennes, before taking a train back to Paris, I was confronted by this radical opinion upon our previous work with Matt Moores (Warwick) and other coauthors from QUT, where the speaker, Maksym Byshkin from Lugano, defended a new approach for maximum likelihood estimation using novel MCMC methods. Based on the point fixe equation characterising maximum likelihood estimators for exponential families, when theoretical and empirical moments of the natural statistic are equal. Using a Markov chain with stationary distribution the said exponential family, the fixed point equation can be turned into a zero divergence equation, requiring simulation of pseudo-data from the model, which depends on the unknown parameter. Breaking this circular argument, the authors note that simulating pseudo-data that reproduce the observed value of the sufficient statistic is enough. Which is related with Geyer and Thomson (1992) famous paper about Monte Carlo maximum likelihood estimation. From there I was and remain lost as I cannot see why a derivative of the expected divergence with respect to the parameter θ can be computed when this divergence is found by Monte Carlo rather than exhaustive enumeration. And later used in a stochastic gradient move on the parameter θ… Especially when the null divergence is imposed on the parameter. In any case, the final slide shows an application to a large image and an Ising model, solving the problem (?) in 140 seconds and suggesting indecency, when our much slower approach is intended to produce a complete posterior simulation in this context.