## Martin Hairer gets Breakthrough Prize (and \$3M)

Posted in Books, University life with tags , , , , , , , , , on September 14, 2020 by xi'an

Just heard the news that Fields Medallist Martin Hairer (formerly U of Warwick) got the 2021 Breakthrough Prize in Mathematics for his unification theory of stochastic partial differential equations, which he likens to a form of Taylor expansion in the massive Inventiones paper describing this breakthrough. (Looking at the previous winners of the prize, who also made its selection committee, this represents a break from focussing primarily on algebraic geometry! If not from sticking to male recipients…)

We introduce a new notion of “regularity structure” that provides an algebraic framework allowing to describe functions and/or distributions via a kind of “jet” or local Taylor expansion around each point. The main novel idea is to replace the classical polynomial model which is suitable for describing smooth functions by arbitrary models that are purpose-built for the problem at hand. In particular, this allows to describe the local behaviour not only of functions but also of large classes of distributions. We then build a calculus allowing to perform the various operations (multiplication, composition with smooth functions, integration against singular kernels) necessary to formulate fixed point equations for a very large class of semi-linear PDEs driven by some very singular (typically random) input. This allows, for the first time, to give a mathematically rigorous meaning to many interesting stochastic PDEs arising in physics. The theory comes with convergence results that allow to interpret the solutions obtained in this way as limits of classical solutions to regularised problems, possibly modified by the addition of diverging counterterms. These counterterms arise naturally through the action of a “renormalisation group” which is defined canonically in terms of the regularity structure associated to the given class of PDEs. Our theory also allows to easily recover many existing results on singular stochastic PDEs (KPZ equation, stochastic quantisation equations, Burgers-type equations) and to understand them as particular instances of a unified framework. One surprising insight is that in all of these instances local solutions are actually “smooth” in the sense that they can be approximated locally to arbitrarily high degree as linear combinations of a fixed family of random functions/distributions that play the role of “polynomials” in the theory. As an example of a novel application, we solve the long-standing problem of building a natural Markov process that is symmetric with respect to the (finite volume) measure describing the $\Phi^4_ 3$ Euclidean quantum field theory. It is natural to conjecture that the Markov process built in this way describes the Glauber dynamic of 3-dimensional ferromagnets near their critical temperature.

## scalable Metropolis-Hastings

Posted in Books, Statistics, Travel with tags , , , , , , , , , on February 12, 2019 by xi'an

Among the flury of arXived papers of last week (414!), including a fair chunk of papers submitted to ICML 2019, I spotted one entry by Cornish et al. on scalable Metropolis-Hastings, which Arnaud Doucet had mentioned to me yesterday when in Oxford. The paper builds on the delayed acceptance paper we wrote with Marco Banterlé, Clara Grazian and Anthony Lee, itself relying on a factorisation decomposition of the likelihood, combined with control variate accelerating techniques. The factorisation of both the target and the proposal allows for a (less efficient) Metropolis-Hastings acceptance ratio that is the product

$\prod_{i=1}^m \alpha_i(\theta,\theta')$

of individual Metropolis-Hastings acceptance ratios, but which allows for quicker rejection if one of the probabilities in the product is small, because the corresponding Bernoulli draw is zero with high probability. One advance made in Michel et al. (2017) [which I doubly missed] is that subsampling is achievable by thinning (as in PDMPs, where these authors have been quite active) through an algorithm of Shantikumar (1985) [described in Devroye’s bible]. Provided each Metropolis-Hastings probability can be lower bounded:

$\alpha_i(\theta,\theta') \ge \exp\{-\psi_i \phi(\theta,\theta')\}$

by a term where the transition φ does not depend on the index i in the product. The computing cost of the thinning process thus depends on the efficiency of the subsampling, namely whether or not the (Poisson) number of terms is much smaller than m, number of terms in the product. A neat trick in the current paper that extends the the Fukui-Todo procedure is to switch to the original Metropolis-Hastings when the overall lower bound is too small, recovering the geometric ergodicity of this original if it holds (Theorem 2.1). Another neat remark is that when using the naïve factorisation as the product of the n individual likelihoods, the resulting algorithm is sort of doomed as n grows, even with an optimal scaling of the proposals. To achieve scalability, the authors introduce a Taylor (i.e., Gaussian) approximation to each local target in the product and start the acceptance decomposition by using the resulting overall Gaussian approximation. Meaning that the remaining product is now made of ratios of targets over their local Taylor approximations, hence most likely close to one. And potentially lower-bounded by the remainder term in the Taylor expansion. Leading to the conclusion that, when everything goes well, meaning that the Taylor expansions can be conducted and the bounds derived for the appropriate expansion, the order of the Poisson scale is O(1/√n)..! The proposal for the Metropolis-Hastings move is actually tuned to the Gaussian approximation, appearing as a variant of the Langevin move or more exactly a discretization of an Hamiltonian move. Obviously, I cannot judge of the complexity in implementing this new scheme from just reading the paper, but this development on the split target is definitely an exciting prospect for handling huge datasets and their friends!

## parameter space for mixture models

Posted in Statistics, University life with tags , , , on March 24, 2017 by xi'an

“The paper defines a new solution to the problem of defining a suitable parameter space for mixture models.”

When I received the table of contents of the incoming Statistics & Computing and saw a paper by V. Maroufy and P. Marriott about the above, I was quite excited about a new approach to mixture parameterisation. Especially after our recent reposting of the weakly informative reparameterisation paper. Alas, after reading the paper, I fail to see the (statistical) point of the whole exercise.

Starting from the basic fact that mixtures face many identifiability issues, not only invariance by component permutation, but the possibility to add spurious components as well, the authors move to an entirely different galaxy by defining mixtures of so-called local mixtures. Developed by one of the authors. The notion is just incomprehensible for me: the object is a weighted sum of the basic component of the original mixture, e.g., a Normal density, and of k of its derivatives wrt its mean, a sort of parameterised Taylor expansion. Which implies the parameter is unidimensional, incidentally. The weights of this strange mixture are furthermore constrained by the positivity of the resulting mixture, a constraint that seems impossible to satisfy in the Normal case when the number of derivatives is odd. And hard to analyse in any case since possibly negative components do not enjoy an interpretation as a probability density. In exponential families, the local mixture is the original exponential family density multiplied by a polynomial. The current paper moves one step further [from the reasonable] by considering mixtures [in the standard sense] of such objects. Which components are parameterised by their mean parameter and a collection of weights. The authors then restrict the mean parameters to belong to a finite and fixed set, which elements are coerced by a maximum error rate on any compound distribution derived from this exponential family structure. The remainder of the paper discusses of the choice of the mean parameters and of an EM algorithm to estimate the parameters, with a confusing lower bound on the mixture weights that impacts the estimation of the weights. And no mention made of the positivity constraint. I remain completely bemused by the paper and its purpose: I do not even fathom how this qualifies as a mixture.

## communication-efficient distributed statistical learning

Posted in Books, Statistics, University life with tags , , , , , , , , on June 10, 2016 by xi'an

Michael Jordan, Jason Lee, and Yun Yang just arXived a paper with their proposal on handling large datasets through distributed computing, thus contributing to the currently very active research topic of approximate solutions in large Bayesian models. The core of the proposal is summarised by the screenshot above, where the approximate likelihood replaces the exact likelihood with a first order Taylor expansion. The first term is the likelihood computed for a given subsample (or a given thread) at a ratio of one to N and the difference of the gradients is only computed once at a good enough guess. While the paper also considers M-estimators and non-Bayesian settings, the Bayesian part thus consists in running a regular MCMC when the log-target is approximated by the above. I first thought this proposal amounted to a Gaussian approximation à la Simon Wood or to an INLA approach but this is not the case: the first term of the approximate likelihood is exact and hence can be of any form, while the scalar product is linear in θ, providing a sort of first order approximation, albeit frozen at the chosen starting value.

Assuming that each block of the dataset is stored on a separate machine, I think the approach could further be implemented in parallel, running N MCMC chains and comparing the output. With a post-simulation summary stemming from the N empirical distributions thus produced. I also wonder how the method would perform outside the fairly smooth logistic regression case, where the single sample captures well-enough the target. The picture above shows a minor gain in a misclassification rate that is already essentially zero.