practical PDMP

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on December 9, 2021 by xi'an

While in Warwick, last month, I attended a reading group on PDMPs where Filippo Pagani talked about practical PDMP, connected with a recent arXival by Bertazzi, Bierkens and Dobson. The central question when implementing PDMP is to find a realistic way of solving

$\int_0^\tau \lambda(x+tv,v)\text dt = \epsilon\quad\epsilon\sim\mathcal Exp(1)$

to decide on the stopping time (when the process ceases to be deterministic). The usual approach is to use Poisson thinning by finding an upper bound on λ, but this is either difficult or potentially inefficient (and sometimes both).

“finding a sharp bound M(s) [for Poisson thinning] can be an extremely challenging problem in most practical settings (…) In order to overcome this problem, we introduce discretisation schemes for PDMPs which make their
approximate simulation possible.”

Some of the solutions proposed in Bertazzi et al. are relying on

1. using a frozen (fixed) λ
2. discretising time and the integral (first order scheme)
3. allowing for more than a jump over a time interval (higher order schemes)
4. going through control variates (when gradient is Lipschitz and Hessian bounded, with known constants) as it produces a linear rate λ
5. subsampling (at least for Zig Zag)

with theoretical guarantees that the approximations are convergent, as the time step goes to zero. They (almost obviously) remain model dependent solutions (with illustrations for the Zig Zag and bouncy particle versions), with little worse case scenarios, but this is an extended investigation into making PDMPs more manageable!

ISBA 2021.3

Posted in Kids, Mountains, pictures, Running, Statistics, Travel, University life, Wines with tags , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , on July 1, 2021 by xi'an

Now on the third day which again started early with a 100% local j-ISBA session. (After a group run to and around Mont Puget, my first real run since 2020!!!) With a second round of talks by junior researchers from master to postdoc level. Again well-attended. A talk about Bayesian non-parametric sequential taxinomy by Alessandro Zito used the BayesANT acronym, which reminded me of the new vave group Adam and the Ants I was listening to forty years ago, in case they need a song as well as a logo! (Note that BayesANT is also used for a robot using Bayesian optimisation!) And more generally a wide variety in the themes. Thanks to the j-organisers of this 100% live session!

The next session was on PDMPs, which I helped organise, with Manon Michel speaking from Marseille, exploiting the symmetry around the gradient, which is distribution-free! Then, remotely, Kengo Kamatani, speaking from Tokyo, who expanded the high-dimensional scaling limit to the Zig-Zag sampler, exhibiting an argument against small refreshment rates, and Murray Pollock, from Newcastle, who exposed quite clearly the working principles of the Restore algorithm, including why coupling from the past was available in this setting. A well-attended session despite the early hour (in the USA).

Another session of interest for me [which I attended by myself as everyone else was at lunch in CIRM!] was the contributed C16 on variational and scalable inference that included a talk on hierarchical Monte Carlo fusion (with my friends Gareth and Murray as co-authors), Darren’s call to adopt functional programming in order to save Bayesian computing from extinction, normalising flows for modularisation, and Dennis’ adversarial solutions for Bayesian design, avoiding the computation of the evidence.

Wes Johnson’s lecture was about stories with setting prior distributions based on experts’ opinions. Which reminded me of the short paper Kaniav Kamary and myself wrote about ten years ago, in response to a paper on the topic in the American Statistician. And could not understand the discrepancy between two Bayes factors based on Normal versus Cauchy priors, until I was told they were mistakenly used repeatedly.

Rushing out of dinner, I attended both the non-parametric session (live with Marta and Antonio!) and the high-dimension computational session on Bayesian model choice (mute!). A bit of a schizophrenic moment, but allowing to get a rough picture in both areas. At once. Including an adaptive MCMC scheme for selecting models by Jim Griffin. Which could be run directly over the model space. With my ever-going wondering at the meaning of neighbour models.

Posted in Books, Mountains, pictures, Statistics with tags , , , , , , , , , , , , , on April 6, 2020 by xi'an

“…expensive in these terms, as for each root, Λ(x(s),v) (at the cost of one epoch) has to be evaluated for each root finding iteration, for each node of the numerical integral

When using the ZigZag sampler, the main (?) difficulty is in producing velocity switch as the switches are produced as interarrival times of an inhomogeneous Poisson process. When the rate of this process cannot be integrated out in an analytical manner, the only generic approach I know is in using Poisson thinning, obtained by finding an integrable upper bound on this rate, generating from this new process and subsampling. Finding the bound is however far from straightforward and may anyway result in an inefficient sampler. This new paper by Simon Cotter, Thomas House and Filippo Pagani makes several proposals to simplify this simulation, Nuzz standing for numerical ZigZag. Even better (!), their approach is based on what they call the Sellke construction, with Tom Sellke being a probabilist and statistician at Purdue University (trivia: whom I met when spending a postdoctoral year there in 1987-1988) who also wrote a fundamental paper on the opposition between Bayes factors and p-values with Jim Berger.

“We chose as a measure of algorithm performance the largest Kolmogorov-Smirnov (KS) distance between the MCMC sample and true distribution amongst all the marginal distributions.”

The practical trick is rather straightforward in that it sums up as the exponentiation of the inverse cdf method, completed with a numerical resolution of the inversion. Based on the QAGS (Quadrature Adaptive Gauss-Kronrod Singularities) integration routine. In order to save time Kingman’s superposition trick only requires one inversion rather than d, the dimension of the variable of interest. This nuzzled version of ZIgZag can furthermore be interpreted as a PDMP per se. Except that it retains a numerical error, whose impact on convergence is analysed in the paper. In terms of Wasserstein distance between the invariant measures. The paper concludes with a numerical comparison between Nuzz and random walk Metropolis-Hastings, HMC, and manifold MALA, using the number of evaluations of the likelihood as a measure of time requirement. Tuning for Nuzz is described, but not for the competition. Rather dramatically the Nuzz algorithm performs worse than this competition when counting one epoch for each likelihood computation and better when counting one epoch for each integral inversion. Which amounts to perfect inversion, unsurprisingly. As a final remark, all models are more or less Normal, with very smooth level sets, maybe not an ideal range

BayesComp’20

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , , , on January 10, 2020 by xi'an

First, I really have to congratulate my friend Jim Hobert for a great organisation of the meeting adopting my favourite minimalist principles (no name tag, no “goodies” apart from the conference schedule, no official talks). Without any pretense at objectivity, I also appreciated very much the range of topics and the sweet frustration of having to choose between two or three sessions each time. Here are some notes taken during some talks (with no implicit implication for the talks no mentioned, re. above frustration! as well as very short nights making sudden lapse in concentration highly likely).

On Day 1, Paul Fearnhead’s inaugural plenary talk was on continuous time Monte Carlo methods, mostly bouncy particle and zig-zag samplers, with a detailed explanation on the simulation of the switching times which likely brought the audience up to speed even if they had never heard of them. And an opening on PDMPs used as equivalents to reversible jump MCMC, reminding me of the continuous time (point process) solutions of Matthew Stephens for mixture inference (and of Preston, Ripley, Møller).

The same morn I heard of highly efficient techniques to handle very large matrices and p>n variables selections by Akihiko Nishimura and Ruth Baker on a delayed acceptance ABC, using a cheap proxy model. Somewhat different from indirect inference. I found the reliance on ESS somewhat puzzling given the intractability of the likelihood (and the low reliability of the frequency estimate) and the lack of connection with the “real” posterior. At the same ABC session, Umberto Picchini spoke on a joint work with Richard Everitt (Warwick) on linking ABC and pseudo-marginal MCMC by bootstrap. Actually, the notion of ABC likelihood was already proposed as pseudo-marginal ABC by Anthony Lee, Christophe Andrieu and Arnaud Doucet in the discussion of Fearnhead and Prangle (2012) but I wonder at the focus of being unbiased when the quantity is not the truth, i.e. the “real” likelihood. It would seem more appropriate to attempt better kernel estimates on the distribution of the summary itself. The same session also involved David Frazier who linked our work on ABC for misspecified models and an on-going investigation of synthetic likelihood.

Later, there was a surprise occurrence of the Bernoulli factory in a talk by Radu Herbei on Gaussian process priors with accept-reject algorithms, leading to exact MCMC, although the computing implementation remains uncertain. And several discussions during the poster session, incl. one on the planning of a 2021 workshop in Oaxaca centred on objective Bayes advances as we received acceptance of our proposal by BIRS today!

On Day 2, David Blei gave a plenary introduction to variational Bayes inference and latent Dirichlet allocations, somewhat too introductory for my taste although other participants enjoyed this exposition. He also mentioned a recent JASA paper on the frequentist consistency of variational Bayes that I should check. Speaking later with PhD students, they really enjoyed this opening on an area they did not know that well.

A talk by Kengo Kamatani (whom I visited last summer) on improved ergodicity rates for heavy tailed targets and Crank-NIcholson modifications to the random walk proposal (which uses an AR(1) representation instead of the random walk). With the clever idea of adding the scale of the proposal as an extra parameter with a prior of its own. Gaining one order of magnitude in the convergence speed (i.e. from d to 1 and from d² to d, where d is the dimension), which is quite impressive (and just published in JAP).Veronica Rockova linked Bayesian variable selection and machine learning via ABC, with conditions on the prior for model consistency. And a novel approach using part of the data to learn an ABC partial posterior, which reminded me of the partial  Bayes factors of the 1990’s although it is presumably unrelated. And a replacement of the original rejection ABC via multi-armed bandits, where each variable is represented by an arm, called ABC Bayesian forests. Recalling the simulation trick behind Thompson’s approach, reproduced for the inclusion or exclusion of variates and producing a fixed estimate for the (marginal) inclusion probabilities, which makes it sound like a prior-feeback form of empirical Bayes. Followed by a talk of Gregor Kastner on MCMC handling of large time series with specific priors and a massive number of parameters.

The afternoon also had a wealth of exciting talks and missed opportunities (in the other sessions!). Which ended up with a strong if unintended French bias since I listened to Christophe Andrieu, Gabriel Stolz, Umut Simsekli, and Manon Michel on different continuous time processes, with Umut linking GANs, multidimensional optimal transport, sliced-Wasserstein, generative models, and new stochastic differential equations. Manon Michel gave a highly intuitive talk on creating non-reversibility, getting rid of refreshment rates in PDMPs to kill any form of reversibility.

non-reversibility in discrete spaces

Posted in Books, Statistics, University life with tags , , , , , , , , , on January 3, 2020 by xi'an

Following a recent JASA paper by Giacomo Zanella (which I have not yet read but is discussed on this blog), Sam Power and Jacob Goldman have recently arXived a paper on Accelerated sampling on discrete spaces with non-reversible Markov processes, where they use continuous-time, non-reversible algorithms à la PDMP, even though differential equations do not exist on discrete spaces. More specifically, they devise discrete versions of the coordinate sampler and of the Zig-Zag sampler, using Markov jump processes instead of differential equations, with detailed balance on the jump rate rather than the Markov kernel. A use of jump processes originating at least from Peskun (1973) and connected with MCMC algorithms in Matthew Stephens‘ 1999 PhD thesis. A neat thing about discrete settings is that the jump process can be implemented with no discretisation! However, as we noticed when working on birth-and-death processes with Olivier Cappé and Tobias Rydèn, there is a potential for disastrous implementation if an infinite sequence of instantaneous moves (out of zero probability states) is proposed.

The authors make the further assumption(s) that the discrete space is endowed with a graphical structure with a group G acting upon this graph, with an involution keeping the target (or a completion of the original target) invariant. In this framework, reversibility amounts to repeatedly using (group) generators þ with a low order (as in Bayesian variable selection, binary spin systems, where þ.þ=id, and other permutation problems), since they bring the chain back to its starting point. Their first sampler is called a Tabu sampler for avoiding such behaviour, forcing the next step to use other generators þ in the generator set Þ thanks to a binary auxiliary variable that partitions Þ into forward vs backward moves. For high order generators, the discrete coordinate and Zig-Zag samplers are instead repeatedly using the same generator (although it is unclear to me why this is beneficial, given that neither graph nor generator is not necessarily linked with the target). With the coordinate sampler being again much cheaper since it only looks at one direction in the generator group.

The paper contains a range of comparisons with (only) Zanella’s sampler, some presenting heavy gains in terms of ESS. Including one on hundreds of sensors in a football stadium. As I am not particularly familiar with these examples, except for the Bayesian variable selection one, I found it rather hard to determine whether or not the compared samplers were indeed exploring the entirety of the (highly complex and highly dimensional) target. The collection of examples is however quite rich and support the use of such non-reversible schemes. It may also be that the discrete nature of the target could facilitate the theoretical study of their convergence properties.