Archive for Kullback-Leibler divergence

simulation as optimization [by kernel gradient descent]

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , , , , , , on April 13, 2024 by xi'an

Yesterday, which proved an unseasonal bright, warm, day, I biked (with a new wheel!) to the east of Paris—in the Gare de Lyon district where I lived for three years in the 1980’s—to attend a Mokaplan seminar at INRIA Paris, where Anna Korba (CREST, to which I am also affiliated) talked about sampling through optimization of discrepancies.
This proved a most formative hour as I had not seen this perspective earlier (or possibly had forgotten about it). Except through some of the talks at the Flatiron Institute on Transport, Diffusions, and Sampling last year. Incl. Marilou Gabrié’s and Arnaud Doucet’s.
The concept behind remains attractive to me, at least conceptually, since it consists in approximating the target distribution, known up to a constant (a setting I have always felt standard simulation techniques was not exploiting to the maximum) or through a sample (a setting less convincing since the sample from the target is already there), via a sequence of (particle approximated) distributions when using the discrepancy between the current distribution and the target or gradient thereof to move the particles. (With no randomness in the Kernel Stein Discrepancy Descent algorithm.)
Ana Korba spoke about practically running the algorithm, as well as about convexity properties and some convergence results (with mixed performances for the Stein kernel, as opposed to SVGD). I remain definitely curious about the method like the (ergodic) distribution of the endpoints, the actual gain against an MCMC sample when accounting for computing time, the improvement above the empirical distribution when using a sample from π and its ecdf as the substitute for π, and the meaning of an error estimation in this context.

“exponential convergence (of the KL) for the SVGD gradient flow does not hold whenever π has exponential tails and the derivatives of ∇ log π and k grow at most at a polynomial rate”

Bayesian differential privacy for free?

Posted in Books, pictures, Statistics with tags , , , , , , , , , , , , on September 24, 2023 by xi'an

“We are interested in the question of how we can build differentially-private algorithms within the Bayesian framework. More precisely, we examine when the choice of prior is sufficient to guarantee differential privacy for decisions that are derived from the posterior distribution (…) we show that the Bayesian statistician’s choice of prior distribution ensures a base level of data privacy through the posterior distribution; the statistician can safely respond to external queries using samples from the posterior.”

Recently I came across this 2016 JMLR paper of Christos Dimitrakakis et al. on “how Bayesian inference itself can be used directly to provide private access to data, with no modification.” Which comes as a surprise since it implies that Bayesian sampling would be enough, per se, to keep both the data private and the information it conveys available. The main assumption on which this result is based is one of Lipschitz continuity of the model density, namely that, for a specific (pseudo-)distance ρ

|\log f(x|\theta)-\log f(y|\theta)|\le L\rho(x,y)

uniformly in θ over a set Θ with enough prior mass

\pi(\Theta)\ge 1-e^{-\epsilon}

for an ε>0. In this case, the Kullback-Leibler divergence between the posteriors π(θ|x) and π(θ|y) is bounded by a constant times ρ(x,y). (The constant being 2L when Θ is the entire parameter space.) This condition ensures differential privacy on the posterior distribution (and even more on the associated MCMC sample). More precisely, (2L,0)-differentially private in the case Θ is the entire parameter space. While there is an efficiency issue linked with the result since the bound L being set by the model and hence immovable, this remains a fundamental result for the field (as shown by its high number of citations).

learning optimal summary statistics

Posted in Books, pictures, Statistics with tags , , , , , , , , , on July 27, 2022 by xi'an

Despite the pursuit of the holy grail of sufficient statistics, most applications will have to settle for the weakest concept of optimal statistics.”Quiz #1: How does Bayes sufficiency [which preserves the posterior density] differ from sufficiency [which preserves the likelihood function]?

Quiz #2: How does Fisher-information sufficiency [which preserves the information matrix] differ from standard sufficiency [which preserves the likelihood function]?

Read a recent arXival by Till Hoffmann and Jukka-Pekka Onnela that I frankly found most puzzling… Maybe due to the Norman train where I was traveling being particularly noisy.

The argument in the paper is to find a summary statistic that minimises the [empirical] expected posterior entropy, which equivalently means minimising the expected Kullback-Leibler distance to the full posterior.  And maximizing the mutual information between parameters θ and summaries t(.). And maximizing the expected surprise. Which obviously requires breaking the sample into iid components and hence considering the gain brought by a specific transform of a single observation. The paper also contains a long comparison with other criteria for choosing summaries.

“Minimizing the posterior entropy would discard the sufficient statistic t such that the posterior is equal to the prior–we have not learned anything from the data.”

Furthermore, the expected aspect of the criterion takes us away from a proper Bayes analysis (and exhibits artifacts as the one above), which somehow makes me question the relevance of comparing entropies under different distributions. It took me a long while to realise that the collection of summaries was set by the user and quite limited. Like a neural network representation of the posterior mean. And the intractable posterior is further approximated by a closed-form function of the parameter θ and of the summary t(.). Using there a neural density estimator. Or a mixture density network.

statistical analysis of GANs

Posted in Books, Statistics with tags , , , , , , , , on May 24, 2021 by xi'an

My friend Gérard Biau and his coauthors have published a paper in the Annals of Statistics last year on the theoretical [statistical] analysis of GANs, which I had missed and recently read with a definitive interest in the issues. (With no image example!)

If the discriminator is unrestricted the unique optimal solution is the Bayes posterior probability

\dfrac{p^\star(x)}{p^\star(x)+p_\theta(x)}

when the model density is everywhere positive. And the optimal parameter θ corresponds to the closest model in terms of Kullback-Leibler divergence. The pseudo-true value of the parameter. This is however the ideal situation, while in practice D is restricted to a parametric family. In this case, if the family is wide enough to approximate the ideal discriminator in the sup norm, with error of order ε, and if the parameter space Θ is compact, the optimal parameter found under the restricted family approximates the pseudo-true value in the sense of the GAN loss, at the order ε². With a stronger assumption on the family ability to approximate any discriminator, the same property holds for the empirical version (and in expectation). (As an aside, the figure illustrating this property confusedly uses an histogramesque rectangle to indicate the expectation of the discriminator loss!) And both parameter (θ and α) estimators converge to the optimal ones with the sample size. An interesting foray from statisticians in a method whose statistical properties are rarely if ever investigated. Missing a comparison with alternative approaches, like MLE, though.

sequential neural likelihood estimation as ABC substitute

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , , , , , , , , , on May 14, 2020 by xi'an

A JMLR paper by Papamakarios, Sterratt, and Murray (Edinburgh), first presented at the AISTATS 2019 meeting, on a new form of likelihood-free inference, away from non-zero tolerance and from the distance-based versions of ABC, following earlier papers by Iain Murray and co-authors in the same spirit. Which I got pointed to during the ABC workshop in Vancouver. At the time I had no idea as to autoregressive flows meant. We were supposed to hold a reading group in Paris-Dauphine on this paper last week, unfortunately cancelled as a coronaviral precaution… Here are some notes I had prepared for the meeting that did not take place.

A simulator model is a computer program, which takes a vector of parameters θ, makes internal calls to a random number generator, and outputs a data vector x.”

Just the usual generative model then.

“A conditional neural density estimator is a parametric model q(.|φ) (such as a neural network) controlled by a set of parameters φ, which takes a pair of datapoints (u,v) and outputs a conditional probability density q(u|v,φ).”

Less usual, in that the outcome is guaranteed to be a probability density.

“For its neural density estimator, SNPE uses a Mixture Density Network, which is a feed-forward neural network that takes x as input and outputs the parameters of a Gaussian mixture over θ.”

In which theoretical sense would it improve upon classical or Bayesian density estimators? Where are the error evaluation, the optimal rates, the sensitivity to the dimension of the data? of the parameter?

“Our new method, Sequential Neural Likelihood (SNL), avoids the bias introduced by the proposal, by opting to learn a model of the likelihood instead of the posterior.”

I do not get the argument in that the final outcome (of using the approximation within an MCMC scheme) remains biased since the likelihood is not the exact likelihood. Where is the error evaluation? Note that in the associated Algorithm 1, the learning set is enlarged on each round, as in AMIS, rather than set back to the empty set ∅ on each round.

…given enough simulations, a sufficiently flexible conditional neural density estimator will eventually approximate the likelihood in the support of the proposal, regardless of the shape of the proposal. In other words, as long as we do not exclude parts of the parameter space, the way we propose parameters does not bias learning the likelihood asymptotically. Unlike when learning the posterior, no adjustment is necessary to account for our proposing strategy.”

This is a rather vague statement, with the only support being that the Monte Carlo approximation to the Kullback-Leibler divergence does converge to its actual value, i.e. a direct application of the Law of Large Numbers! But an interesting point I informally made a (long) while ago that all that matters is the estimate of the density at x⁰. Or at the value of the statistic at x⁰. The masked auto-encoder density estimator is based on a sequence of bijections with a lower-triangular Jacobian matrix, meaning the conditional density estimate is available in closed form. Which makes it sounds like a form of neurotic variational Bayes solution.

The paper also links with ABC (too costly?), other parametric approximations to the posterior (like Gaussian copulas and variational likelihood-free inference), synthetic likelihood, Gaussian processes, noise contrastive estimation… With experiments involving some of the above. But the experiments involve rather smooth models with relatively few parameters.

“A general question is whether it is preferable to learn the posterior or the likelihood (…) Learning the likelihood can often be easier than learning the posterior, and it does not depend on the choice of proposal, which makes learning easier and more robust (…) On the other hand, methods such as SNPE return a parametric model of the posterior directly, whereas a further inference step (e.g. variational inference or MCMC) is needed on top of SNL to obtain a posterior estimate”

A fair point in the conclusion. Which also mentions the curse of dimensionality (both for parameters and observations) and the possibility to work directly with summaries.

Getting back to the earlier and connected Masked autoregressive flow for density estimation paper, by Papamakarios, Pavlakou and Murray:

“Viewing an autoregressive model as a normalizing flow opens the possibility of increasing its flexibility by stacking multiple models of the same type, by having each model provide the source of randomness for the next model in the stack. The resulting stack of models is a normalizing flow that is more flexible than the original model, and that remains tractable.”

Which makes it sound like a sort of a neural network in the density space. Optimised by Kullback-Leibler minimisation to get asymptotically close to the likelihood. But a form of Bayesian indirect inference in the end, namely an MLE on a pseudo-model, using the estimated model as a proxy in Bayesian inference…