Archive for University of Warwick

distilling importance

Posted in Books, Statistics, University life with tags , , , , , , , , , , on November 13, 2019 by xi'an

As I was about to leave Warwick at the end of last week, I noticed a new arXival by Dennis Prangle, distilling importance sampling. In connection with [our version of] population Monte Carlo, “each step of [Dennis’] distilled importance sampling method aims to reduce the Kullback Leibler (KL) divergence from the distilled density to the current tempered posterior.”  (The introduction of the paper points out various connections with ABC, conditional density estimation, adaptive importance sampling, X entropy, &tc.)

“An advantage of [distilled importance sampling] over [likelihood-free] methods is that it performs inference on the full data, without losing information by using summary statistics.”

A notion used therein I had not heard before is the one of normalising flows, apparently more common in machine learning and in particular with GANs. (The slide below is from Shakir Mohamed and Danilo Rezende.) The  notion is to represent an arbitrary variable as the bijective transform of a standard variate like a N(0,1) variable or a U(0,1) variable (calling the inverse cdf transform). The only link I can think of is perfect sampling where the representation of all simulations as a function of a white noise vector helps with coupling.

I read a blog entry by Eric Jang on the topic (who produced this slide among other things) but did not emerge much the wiser. As the text instantaneously moves from the Jacobian formula to TensorFlow code… In Dennis’ paper, it appears that the concept is appealing for quickly producing samples and providing a rich family of approximations, especially when neural networks are included as transforms. They are used to substitute for a tempered version of the posterior target, validated as importance functions and aiming at being the closest to this target in Kullback-Leibler divergence. With the importance function interpretation, unbiased estimators of the gradient [in the parameter of the normalising flow] can be derived, with potential variance reduction. What became clearer to me from reading the illustration section is that the prior x predictive joint can also be modeled this way towards producing reference tables for ABC (or GANs) much faster than with the exact model. (I came across several proposals of that kind in the past months.) However, I deem mileage should vary depending on the size and dimension of the data. I also wonder at the connection between the (final) distribution simulated by distilled importance [the least tempered target?] and the ABC equivalent.

postgraduate open day at Warwick [4 Dec]

Posted in pictures, Statistics, University life with tags , , , , , , , , , , on November 12, 2019 by xi'an

The department of Statistics at the University of Warwick is holding an open day for prospective PhD students on 4 December 2019, starting at 2pm (with free lunch at 1pm). In the Mathematical Sciences Building common room (room MB1.02). The Director of Graduate Studies, Professor Mark Steel, and the PhD admissions tutors Professors Martyn Plummer and Barbel Finkelstadt Rand will give short presentations about what it means to do a PhD, what it means to do it at Warwick, the benefits of a PhD degree, and the application process.

Subsequently there will be an informal meeting, during which students have the possibility to ask questions and find out more about the different PhD opportunities at Warwick Statistics; in fact, we offer a very broad range of possibilities, giving a lot of choice for potential applicants. Current members of staff will be invited to participate, to discuss potential projects.

UK travel expenses will be covered by the Department of Statistics (standard class travel by public transport with pre-booked tickets). Please register if interested in this event.

revisiting the balance heuristic

Posted in Statistics with tags , , , , , , , on October 24, 2019 by xi'an

Last August, Felipe Medina-Aguayo (a former student at Warwick) and Richard Everitt (who has now joined Warwick) arXived a paper on multiple importance sampling (for normalising constants) that goes “exploring some improvements and variations of the balance heuristic via a novel extended-space representation of the estimator, leading to straightforward annealing schemes for variance reduction purposes”, with the interesting side remark that Rao-Blackwellisation may prove sub-optimal when there are many terms in the proposal family, in the sense that not every term in the mixture gets sampled. As already noticed by Victor Elvira and co-authors, getting rid of the components that are not used being an improvement without inducing a bias. The paper also notices that the loss due to using sample sizes rather than expected sample sizes is of second order, compared with the variance of the compared estimators. It further relates to a completion or auxiliary perspective that reminds me of the approaches we adopted in the population Monte Carlo papers and in the vanilla Rao-Blackwellisation paper. But it somewhat diverges from this literature when entering a simulated annealing perspective, in that the importance distributions it considers are freely chosen as powers of a generic target. It is quite surprising that, despite the normalising weights being unknown, a simulated annealing approach produces an unbiased estimator of the initial normalising constant. While another surprise therein is that the extended target associated to their balance heuristic does not admit the right density as marginal but preserves the same normalising constant… (This paper will be presented at BayesComp 2020.)


Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , , , , on October 8, 2019 by xi'an

In connection with the recent PhD thesis defence of Juliette Chevallier, in which I took a somewhat virtual part for being physically in Warwick, I read a paper she wrote with Stéphanie Allassonnière on stochastic approximation versions of the EM algorithm. Computing the MAP estimator can be done via some adapted for simulated annealing versions of EM, possibly using MCMC as for instance in the Monolix software and its MCMC-SAEM algorithm. Where SA stands sometimes for stochastic approximation and sometimes for simulated annealing, originally developed by Gilles Celeux and Jean Diebolt, then reframed by Marc Lavielle and Eric Moulines [friends and coauthors]. With an MCMC step because the simulation of the latent variables involves an untractable normalising constant. (Contrary to this paper, Umberto Picchini and Adeline Samson proposed in 2015 a genuine ABC version of this approach, paper that I thought I missed—although I now remember discussing it with Adeline at JSM in Seattle—, ABC is used as a substitute for the conditional distribution of the latent variables given data and parameter. To be used as a substitute for the Q step of the (SA)EM algorithm. One more approximation step and one more simulation step and we would reach a form of ABC-Gibbs!) In this version, there are very few assumptions made on the approximation sequence, except that it converges with the iteration index to the true distribution (for a fixed observed sample) if convergence of ABC-SAEM is to happen. The paper takes as an illustrative sequence a collection of tempered versions of the true conditionals, but this is quite formal as I cannot fathom a feasible simulation from the tempered version and not from the untempered one. It is thus much more a version of tempered SAEM than truly connected with ABC (although a genuine ABC-EM version could be envisioned).

from here to infinity

Posted in Books, Statistics, Travel with tags , , , , , , , , , , , , , on September 30, 2019 by xi'an

“Introducing a sparsity prior avoids overfitting the number of clusters not only for finite mixtures, but also (somewhat unexpectedly) for Dirichlet process mixtures which are known to overfit the number of clusters.”

On my way back from Clermont-Ferrand, in an old train that reminded me of my previous ride on that line that took place in… 1975!, I read a fairly interesting paper published in Advances in Data Analysis and Classification by [my Viennese friends] Sylvia Früwirth-Schnatter and Gertrud Malsiner-Walli, where they describe how sparse finite mixtures and Dirichlet process mixtures can achieve similar results when clustering a given dataset. Provided the hyperparameters in both approaches are calibrated accordingly. In both cases these hyperparameters (scale of the Dirichlet process mixture versus scale of the Dirichlet prior on the weights) are endowed with Gamma priors, both depending on the number of components in the finite mixture. Another interesting feature of the paper is to witness how close the related MCMC algorithms are when exploiting the stick-breaking representation of the Dirichlet process mixture. With a resolution of the label switching difficulties via a point process representation and k-mean clustering in the parameter space. [The title of the paper is inspired from Ian Stewart’s book.]

Bayesian webinar: Bayesian conjugate gradient

Posted in Statistics with tags , , , , , , on September 25, 2019 by xi'an

Bayesian Analysis is launching its webinar series on discussion papers! Meaning the first 90 registrants will be able to participate interactively via the Zoom Conference platform while additional registrants will be able to view the Webinar on a dedicated YouTube Channel. This fantastic initiative is starting with the Bayesian conjugate gradient method of Jon Cockayne (University of Warwick) et al., on October 2 at 4pm Greenwich time. (With available equivalences for other time zones!) I strongly support this initiative and wish it the widest possible success, as it could bring a new standard for conferences, having distant participants gathering in a nearby location to present talks and attend other talks from another part of the World, while effectively participating. An dense enough network could even see the non-stop conference emerging!

unimaginable scale culling

Posted in Books, pictures, Statistics, Travel with tags , , , , , , , , , , , , , on September 17, 2019 by xi'an

Despite the evidence brought by ABC on the inefficiency of culling in massive proportions the British Isles badger population against bovine tuberculosis, the [sorry excuse for a] United Kingdom government has permitted a massive expansion of badger culling, with up to 64,000 animals likely to be killed this autumn… Since the cows are the primary vectors of the disease, what about starting with these captive animals?!