**B**ased on a (basic) question on X validated, I re-read Halmos‘ (1946) famous paper on the non-existence of centred moments of order larger than the sample size. While the exposition may sound a wee bit daunting, the reasoning is essentially based on a recursion and the binomial theorem, since expanding the kth power leads to lesser moments, all of which can be estimated from a subsample.

## Archive for the University life Category

## re-reading Halmos (1946)

Posted in Books, Kids, Statistics, University life with tags central moments, cross validated, Paul Halmos, unbiasedness on May 16, 2020 by xi'an## stratified ABC [One World ABC webinar]

Posted in Books, Statistics, University life with tags Ascension, Australia, BayesComp 2020, bootstrap likelihood, Gainesville, groundhog day, Monash University, One World ABC Seminar, stratified sampling on May 15, 2020 by xi'an**T**he third episode of the One World ABC seminar (Season 1!) was kindly delivered by Umberto Picchini on Stratified sampling and bootstrapping for ABC which I already if briefly discussed after BayesComp 2020. Which sounds like a million years ago… His introduction on the importance of estimating the likelihood using a kernel, while 600% justified wrt his talk, made the One World ABC seminar sounds almost like groundhog day! The central argument is in the computational gain brought by simulating a single θ dependent [expensive] dataset followed by [cheaper] bootstrap replicates. Which turns de fact into bootstrapping the summary statistics.

If I understand correctly, the post-stratification approach of Art Owen (2013?, I cannot find the reference) corrects a misrepresentation of mine. Indeed, defining a partition with unknown probability weights seemed to me to annihilate the appeal of stratification, because the Bernoulli variance of the estimated probabilities brought back the same variability as the mother estimator. But with bootstrap, this requires only two simulations, one for the weights and one for the target. And further allows for a larger ABC tolerance *in fine*. Free lunch?!

The speaker in two weeks (21 May or Ascension Thursday!) is my friend and co-author Gael Martin from Monash University, who will speak on Focused Bayesian prediction, at quite a late time down under..!

## sequential neural likelihood estimation as ABC substitute

Posted in Books, Kids, Statistics, University life with tags ABC, AISTATS 2019, AMIS, autoregressive flow, Bayesian inference, Gaussian copula, Gaussian processes, indirect inference, JMLR, Kullback-Leibler divergence, MCMC, neural density estimator, neural network, noise-contrastive estimation, normalizing flow, Scotland, synthetic likelihood, University of Edinburgh, variational Bayes methods on May 14, 2020 by xi'an**A** JMLR paper by Papamakarios, Sterratt, and Murray (Edinburgh), first presented at the AISTATS 2019 meeting, on a new form of likelihood-free inference, away from non-zero tolerance and from the distance-based versions of ABC, following earlier papers by Iain Murray and co-authors in the same spirit. Which I got pointed to during the ABC workshop in Vancouver. At the time I had no idea as to autoregressive flows meant. We were supposed to hold a reading group in Paris-Dauphine on this paper last week, unfortunately cancelled as a coronaviral precaution… Here are some notes I had prepared for the meeting that did not take place.

“A simulator model is a computer program, which takes a vector of parameters θ, makes internal calls to a random number generator, and outputs a data vector x.”

Just the usual generative model then.

“A conditional neural density estimator is a parametric model q(.|φ) (such as a neural network) controlled by a set of parameters φ, which takes a pair of datapoints (u,v) and outputs a conditional probability density q(u|v,φ).”

Less usual, in that the outcome is guaranteed to be a probability density.

“For its neural density estimator, SNPE uses a Mixture Density Network, which is a feed-forward neural network that takes x as input and outputs the parameters of a Gaussian mixture over θ.”

In which theoretical sense would it improve upon classical or Bayesian density estimators? Where are the error evaluation, the optimal rates, the sensitivity to the dimension of the data? of the parameter?

“Our new method, Sequential Neural Likelihood (SNL), avoids the bias introduced by the proposal, by opting to learn a model of the likelihood instead of the posterior.”

I do not get the argument in that the final outcome (of using the approximation within an MCMC scheme) remains biased since the likelihood is not the exact likelihood. Where is the error evaluation? Note that in the associated Algorithm 1, the learning set is enlarged on each round, as in AMIS, rather than set back to the empty set ∅ on each round.

“

…given enough simulations, a sufficiently flexible conditional neural density estimator will eventually approximate the likelihood in the support of the proposal, regardless of the shape of the proposal. In other words, as long as we do not exclude parts of the parameter space, the way we propose parameters does not bias learning the likelihood asymptotically. Unlike when learning the posterior, no adjustment is necessary to account for our proposing strategy.”

This is a rather vague statement, with the only support being that the Monte Carlo approximation to the Kullback-Leibler divergence does converge to its actual value, i.e. a direct application of the Law of Large Numbers! But an interesting point I informally made a (long) while ago that all that matters is the estimate of the density at x⁰. Or at the value of the statistic at x⁰. The masked auto-encoder density estimator is based on a sequence of bijections with a lower-triangular Jacobian matrix, meaning the conditional density estimate is available in closed form. Which makes it sounds like a form of neurotic variational Bayes solution.

The paper also links with ABC (too costly?), other parametric approximations to the posterior (like Gaussian copulas and variational likelihood-free inference), synthetic likelihood, Gaussian processes, noise contrastive estimation… With experiments involving some of the above. But the experiments involve rather smooth models with relatively few parameters.

“A general question is whether it is preferable to learn the posterior or the likelihood (…) Learning the likelihood can often be easier than learning the posterior, and it does not depend on the choice of proposal, which makes learning easier and more robust (…) On the other hand, methods such as SNPE return a parametric model of the posterior directly, whereas a further inference step (e.g. variational inference or MCMC) is needed on top of SNL to obtain a posterior estimate”

A fair point in the conclusion. Which also mentions the curse of dimensionality (both for parameters and observations) and the possibility to work directly with summaries.

Getting back to the earlier and connected Masked autoregressive flow for density estimation paper, by Papamakarios, Pavlakou and Murray:

“Viewing an autoregressive model as a normalizing flow opens the possibility of increasing its flexibility by stacking multiple models of the same type, by having each model provide the source of randomness for the next model in the stack. The resulting stack of models is a normalizing flow that is more flexible than the original model, and that remains tractable.”

Which makes it sound like a sort of a neural network in the density space. Optimised by Kullback-Leibler minimisation to get asymptotically close to the likelihood. But a form of Bayesian indirect inference in the end, namely an MLE on a pseudo-model, using the estimated model as a proxy in Bayesian inference…

## neural importance sampling

Posted in Books, Kids, pictures, Statistics, University life with tags autoencoder, Bayesian neural networks, bias, coupling layer, Dennis Prangle, Disney, importance sampling, light transport, normalizing flow, path sampling, Université Paris Dauphine on May 13, 2020 by xi'an**D**ennis Prangle signaled this paper during his talk of last week, first of our ABC ‘minars now rechristened as The One World ABC Seminar to join the “One World xxx Seminar” franchise! The paper is written by Thomas Müller and co-authors, all from Disney research [hence the illustration], and we discussed it in our internal reading seminar at Dauphine. The authors propose to parameterise the importance sampling density via neural networks, just like Dennis is using auto-encoders. Starting with the goal of approximating

(where they should assume *f* to be non-negative for the following), the authors aim at simulating from an approximation of f(x)/ℑ since this “ideal” pdf would give zero variance.

“Unfortunately, the above integral is often not solvable in closed form, necessitating its estimation with another Monte Carlo estimator.”

Among the discussed solutions, the Latent-Variable Model one is based on a pdf represented as a marginal. A mostly intractable integral, which the authors surprisingly seem to deem an issue as they do not mention the standard solution of simulating from the joint and using the conditional in the importance weight. (Or even more surprisingly and obviously wrongly see the latter as a biased approximation to the weight.)

“These “autoregressive flows” offer the desired exact evaluation of q(x;θ). Unfortunately, they generally only permiteitherefficient sample generation or efficient evaluation of q(x;θ), which makes them prohibitively expensive for our application to Mont Carlo integration.”

When presenting normalizing flows, namely the representation of the simulation output as the result of an invertible mapping of a standard (e.g., Gaussian or Uniform) random variable, x=h(u,θ), which can itself be decomposed into a composition of suchwise functions. And I am thus surprised this cannot be done in an efficient manner if transforms are well chosen…

“The key proposition of Dinh et al. (2014) is to focus on a specific class of mappings—referred to as coupling layers—that admit Jacobian matrices where determinants reduce to the product of diagonal terms.

Using a transform with a triangular Jacobian at each stage has the appeal of keeping the change of variable simple and allowing for non-linear transforms. Namely piecewise polynomials. When reading the one-blob (!) encoding , I am however uncertain the approach is more than the choice of a particular functional basis, as for instance wavelets (which may prove more costly to handle, granted!)

“Given that NICE scales well to high-dimensional problems…”

It is always unclear to me why almost *every* ML paper feels the urge to redefine & motivate the KL divergence. And to recall that it avoids bothering about the normalising constant. Looking at the variance of the MC estimator & seeking minimal values is praiseworthy, but *only* when the variance exists. What are the guarantees on the density estimate for this to happen? And where are the arguments for NICE scaling nicely to high dimensions? Interesting intrusion of path sampling, but is it of any use outside image analysis—I had forgotten Eric Veach’s original work was on light transport—?

## Monte Carlo Markov chains

Posted in Books, Statistics, University life with tags Andrei Kolmogorov, Bayesian Analysis, Bayesian model comparison, book review, CHANCE, Gregor Mendel, iris data, irreducibility, JASA, Jeffreys priors, Kolmogorov axioms, Kolmogorov-Smirnov distance, MCMC, physics, population genetics, pot-pourri, recurrence, Richard Price, Ronald Fisher, Springer-Verlag, textbook, Thomas Bayes, W. Gosset on May 12, 2020 by xi'an**D**arren Wraith pointed out this (currently free access) Springer book by Massimiliano Bonamente [whose family name means *good spirit* in Italian] to me for its use of the unusual *Monte Carlo Markov chain* rendering of MCMC. (Google Trend seems to restrict its use to California!) This is a graduate text for physicists, but one could nonetheless expect more rigour in the processing of the topics. Particularly of the Bayesian topics. Here is a pot-pourri of memorable quotes:

*“Two major avenues are available for the assignment of probabilities. One is based on the repetition of the experiments a large number of times under the same conditions, and goes under the name of the frequentist or classical method. The other is based on a more theoretical knowledge of the experiment, but without the experimental requirement, and is referred to as the Bayesian approach.”*

*“The Bayesian probability is assigned based on a quantitative understanding of the nature of the experiment, and in accord with the Kolmogorov axioms. It is sometimes referred to as *empirical probability*, in recognition of the fact that sometimes the probability of an event is assigned based upon a practical knowledge of the experiment, although without the classical requirement of repeating the experiment for a large number of times. This method is named after the Rev. Thomas Bayes, who pioneered the development of the theory of probability.”*

*“The likelihood P(B/A) represents the probability of making the measurement B given that the model A is a correct description of the experiment.”*

*“…a uniform distribution is normally the logical assumption in the absence of other information.”*

*“The Gaussian distribution can be considered as a special case of the binomial, when the number of tries is sufficiently large.”*

*“This clearly does not mean that the Poisson distribution has no variance—in that case, it would not be a random variable!”*

*“The method of moments therefore returns unbiased estimates for the mean and variance of every distribution in the case of a large number of measurements.”*

*“The great advantage of the Gibbs sampler is the fact that the acceptance is 100 %, since there is no rejection of candidates for the Markov chain, unlike the case of the Metropolis–Hastings algorithm.”*

Let me then point out (or just whine about!) the book using “statistical independence” for plain independence, the use of / rather than Jeffreys’ | for conditioning (and sometimes forgetting \ in some LaTeX formulas), the confusion between events and random variables, esp. when computing the posterior distribution, between models and parameter values, the reliance on discrete probability for continuous settings, as in the Markov chain chapter, confusing density and probability, using Mendel’s pea data without mentioning the unlikely fit to the expected values (or, as put more subtly by Fisher (1936), “the data of most, if not all, of the experiments have been falsified so as to agree closely with Mendel’s expectations”), presenting Fisher’s and Anderson’s *Iris data* [a motive for rejection when George was JASA editor!] as a “a new classic experiment”, mentioning Pearson but not Lee for the data in the 1903 Biometrika paper “On the laws of inheritance in man” (and woman!), and not accounting for the discrete nature of this data in the linear regression chapter, the three page derivation of the Gaussian distribution from a Taylor expansion of the Binomial pmf obtained by differentiating in the integer argument, spending endless pages on deriving standard properties of classical distributions, this appalling mess of adding over the conditioning atoms with no normalisation in a Poisson experiment

,

botching the proof of the CLT, which is treated *before* the Law of Large Numbers, restricting maximum likelihood estimation to the Gaussian and Poisson cases and muddling its meaning by discussing unbiasedness, confusing a drifted Poisson random variable with a drift on its parameter, as well as using the pmf of the Poisson to define an area under the curve (Fig. 5.2), sweeping the improperty of a constant prior under the carpet, defining a null hypothesis as a range of values for a summary statistic, no mention of Bayesian perspectives in the hypothesis testing, model comparison, and regression chapters, having one-dimensional case chapters followed by two-dimensional case chapters, reducing model comparison to the use of the Kolmogorov-Smirnov test, processing bootstrap and jackknife in the Monte Carlo chapter without a mention of importance sampling, stating recurrence results without assuming irreducibility, motivating MCMC by the intractability of the evidence, resorting to the term *link* to designate the current value of a Markov *chain*, incorporating the need for a prior distribution in a terrible description of the Metropolis-Hastings algorithm, including a discrete proof for its stationarity, spending many pages on early 1990’s MCMC convergence tests rather than discussing the adaptive scaling of proposal distributions, the inclusion of numerical tables [in a 2017 book] and turning Bayes (1763) into Bayes and Price (1763), or Student (1908) into Gosset (1908).

*[Usual disclaimer about potential self-plagiarism: this post or an edited version of it could possibly appear later in my Books Review section in CHANCE. Unlikely, though!]*

## politics coming [too close to] statistics [or the reverse]

Posted in Books, pictures, Statistics, University life with tags Bayesian decision theory, Boris Johnson, coronavirus epidemics, COVID-19, data, David Spiegelhalter, death certificate, Jim Smith, official statistics, The Guardian, UK Parliement, University of Warwick on May 9, 2020 by xi'an**O**n 30 April, David Spiegelhalter wrote an opinion column in The Guardian, Coronavirus deaths: how does Britain compare with other countries?, where he pointed out the difficulty, even “for a bean-counting statistician to count deaths”, as the reported figures are undercounts, and stated that “many feel that excess deaths give a truer picture of the impact of an epidemic“. Which, on the side, I indeed believe is a more objective material, as also reported by INSEE and INED in France.

“…my cold, statistical approach is to wait until the end of the year, and the years after that, when we can count the excess deaths. Until then, this grim contest won’t produce any league tables we can rely on.”D. Spiegelhalter

My understanding of the tribune is that the quick accumulation of raw numbers, even for deaths, and their use in the comparison of procedures and countries is not helping in understanding the impacts of policies and actions-reactions from a week ago. Starting with the delays in reporting death certificates, as again illustrated by the ten day lag in the INSEE reports. And accounting for covariates such as population density, economic and health indicators. (The graph below for instance relies on deaths so far *attributed* to COVID-19 rather than on excess deaths, while these attributions depend on the country policy and its official statistics capacities.)

*“Polite request to PM and others: please stop using my Guardian article to claim we cannot make any international comparisons yet. I refer only to detailed league tables—of course we should now use other countries to try and learn why our numbers are high.” *D. Spiegelhalter

However, when on 6 May Boris Johnson used this Guardian article during prime minister’s questions in the UK Parliement, to defuse a question from the Labour leader, Keir Starmer, David Spiegelhalter reacted with the above tweet, which is indeed that even with poor and undercounted data the total number of cases is much worse than predicted by the earlier models and deadlier than in neighbouring countries. Anyway, three other fellow statisticians, Phil Brown, Jim Smith (Warwick), and Henry Wynn, also reacted to David’s tribune by complaining at the lack of statistical modelling behind it and the fatalistic message it carries, advocating for model based decision-making, which would be fine if the data was not so unreliable… or if the proposed models were equipped with uncertainty bumpers accounting for misspecification and erroneous data.