Archive for Gaussian processes

sample-efficient inference for simulators: complex noise models and time-series [One World ABC seminar]

Posted in Statistics with tags , , , , , , on February 18, 2023 by xi'an

The next One World ABC seminar will take place next Thursday, 23 Feb, at 9:30 UK time, with a talk by Alexander Aushev, on the above, based on a paper with Tran, Pesonen, Howes, and Kaski:

Simulators are becoming more complex, with their parameter inference requiring as few simulations as possible. This talk will go over two likelihood-free inference (LFI) challenges for computationally intensive simulators. The first challenge is modeling complex simulator noise, which is frequently oversimplified by existing methods or needs far too many simulations. I will discuss how LFI can handle multimodal, non-stationary, and heteroscedastic noise distributions in Bayesian Optimization by using deep Gaussian processes as surrogate models. The second challenge involves simulators in time-series settings, in which the observed time-series data is generated by an unknown stochastic process of simulator parameters. Modern LFI methods, in such cases, either require an accurate model of parameter transition dynamics (e.g. available for sampling) or assume it to be linear. In the last part of the talk, I will discuss the challenges and solutions for performing LFI in such time-series settings, which involve learning the unknown transition dynamics of simulator parameters.

Basque thesis defence [Bayes almost on the beach]

Posted in Books, Kids, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , on October 21, 2021 by xi'an

Yesterday morning I took part in a thesis defence (as a jury member) in the coastal city of Anglet, in the (French part of the) Basque Country. The PhD candidate was Sébastien Coube-Sisqueille, whom I did not know directly (although we had crossed paths at CIRM years ago and he had attended my MCMC course at ENSAE even more years ago). As it happened all other members of the committee, apart from Sébastien’s advisor, Benoît Liquet, were on Teams, being unable to travel to the Basque Country. Sébastien’s thesis is about MCMC strategies to accelerate convergence in spatial models represented as nearest neighbor Gaussian processes (NNGP), which relates to the earlier works of (X)XL on interweaving. (Unsurprisingly, the defence was successful and the candidate awarded his PhD!) Icing on the cake, I managed to take a dip in the Atlantic Ocean, before flying back to Paris for dinner, on a very warm afternoon (and slightly cooler water), thanks to Sébastien driving me to a nearby beach!

sequential neural likelihood estimation as ABC substitute

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , , , , , , , , , on May 14, 2020 by xi'an

A JMLR paper by Papamakarios, Sterratt, and Murray (Edinburgh), first presented at the AISTATS 2019 meeting, on a new form of likelihood-free inference, away from non-zero tolerance and from the distance-based versions of ABC, following earlier papers by Iain Murray and co-authors in the same spirit. Which I got pointed to during the ABC workshop in Vancouver. At the time I had no idea as to autoregressive flows meant. We were supposed to hold a reading group in Paris-Dauphine on this paper last week, unfortunately cancelled as a coronaviral precaution… Here are some notes I had prepared for the meeting that did not take place.

A simulator model is a computer program, which takes a vector of parameters θ, makes internal calls to a random number generator, and outputs a data vector x.”

Just the usual generative model then.

“A conditional neural density estimator is a parametric model q(.|φ) (such as a neural network) controlled by a set of parameters φ, which takes a pair of datapoints (u,v) and outputs a conditional probability density q(u|v,φ).”

Less usual, in that the outcome is guaranteed to be a probability density.

“For its neural density estimator, SNPE uses a Mixture Density Network, which is a feed-forward neural network that takes x as input and outputs the parameters of a Gaussian mixture over θ.”

In which theoretical sense would it improve upon classical or Bayesian density estimators? Where are the error evaluation, the optimal rates, the sensitivity to the dimension of the data? of the parameter?

“Our new method, Sequential Neural Likelihood (SNL), avoids the bias introduced by the proposal, by opting to learn a model of the likelihood instead of the posterior.”

I do not get the argument in that the final outcome (of using the approximation within an MCMC scheme) remains biased since the likelihood is not the exact likelihood. Where is the error evaluation? Note that in the associated Algorithm 1, the learning set is enlarged on each round, as in AMIS, rather than set back to the empty set ∅ on each round.

…given enough simulations, a sufficiently flexible conditional neural density estimator will eventually approximate the likelihood in the support of the proposal, regardless of the shape of the proposal. In other words, as long as we do not exclude parts of the parameter space, the way we propose parameters does not bias learning the likelihood asymptotically. Unlike when learning the posterior, no adjustment is necessary to account for our proposing strategy.”

This is a rather vague statement, with the only support being that the Monte Carlo approximation to the Kullback-Leibler divergence does converge to its actual value, i.e. a direct application of the Law of Large Numbers! But an interesting point I informally made a (long) while ago that all that matters is the estimate of the density at x⁰. Or at the value of the statistic at x⁰. The masked auto-encoder density estimator is based on a sequence of bijections with a lower-triangular Jacobian matrix, meaning the conditional density estimate is available in closed form. Which makes it sounds like a form of neurotic variational Bayes solution.

The paper also links with ABC (too costly?), other parametric approximations to the posterior (like Gaussian copulas and variational likelihood-free inference), synthetic likelihood, Gaussian processes, noise contrastive estimation… With experiments involving some of the above. But the experiments involve rather smooth models with relatively few parameters.

“A general question is whether it is preferable to learn the posterior or the likelihood (…) Learning the likelihood can often be easier than learning the posterior, and it does not depend on the choice of proposal, which makes learning easier and more robust (…) On the other hand, methods such as SNPE return a parametric model of the posterior directly, whereas a further inference step (e.g. variational inference or MCMC) is needed on top of SNL to obtain a posterior estimate”

A fair point in the conclusion. Which also mentions the curse of dimensionality (both for parameters and observations) and the possibility to work directly with summaries.

Getting back to the earlier and connected Masked autoregressive flow for density estimation paper, by Papamakarios, Pavlakou and Murray:

“Viewing an autoregressive model as a normalizing flow opens the possibility of increasing its flexibility by stacking multiple models of the same type, by having each model provide the source of randomness for the next model in the stack. The resulting stack of models is a normalizing flow that is more flexible than the original model, and that remains tractable.”

Which makes it sound like a sort of a neural network in the density space. Optimised by Kullback-Leibler minimisation to get asymptotically close to the likelihood. But a form of Bayesian indirect inference in the end, namely an MLE on a pseudo-model, using the estimated model as a proxy in Bayesian inference…

BayesComp 2020 at a glance

Posted in Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , on December 18, 2019 by xi'an

delayed-acceptance. ADA boosted

Posted in Statistics with tags , , , , , on August 11, 2019 by xi'an

Samuel Wiqvist and co-authors from Scandinavia have recently arXived a paper on a new version of delayed acceptance MCMC. The ADA in the novel algorithm stands for approximate and accelerated, where the approximation in the first stage is to use a Gaussian process to replace the likelihood. In our approach, we used subsets for partial likelihoods, ordering them so that the most varying sub-likelihoods were evaluated first. Furthermore, if a parameter reaches the second stage, the likelihood is not necessarily evaluated, based on the global probability that a second stage is rejected or accepted. Which of course creates an approximation. Even when using a local predictor of the probability. The outcome of a comparison in two complex models is that the delayed approach does not necessarily do better than particle MCMC in terms of effective sample size per second, since it does reject significantly more. Using various types of surrogate likelihoods and assessments of the approximation effect could boost the appeal of the method. Maybe using ABC first could suggest another surrogate?

%d bloggers like this: