## Jana de Wiljes’ colloquium at Warwick

Posted in Statistics with tags , , , , , , on February 25, 2020 by xi'an

## scalable Langevin exact algorithm

Posted in Books, Statistics, Travel, University life with tags , , , , , , , , , , , on October 18, 2016 by xi'an

“By employing a modification to existing naïve subsampling techniques we can obtain an algorithm which is still exact but has sub-linear iterative cost as a function of data size.”

A few weeks ago Murray Pollock, Paul Fearnhead, Adam Johansen and Gareth Roberts (all from Warwick except for Paul) arXived a paper The Scalable Langevin Exact Algorithm: Bayesian Inference for Big Data. (This was also the topic of Murray’s talk last year at JSM in Seattle.) One major advance found in the paper is the derivation of an “exact” algorithm that is sub-linear in the data size. As discussed in the introduction, the current approaches to large data problems either suffer from being approximate (like divide-and-conquer methods) or do not achieve significant reduction in the computing time, being of order O(n). The authors mention Teh and Welling (2011) sand their tochastic gradient approximation to the Langevin diffusion, when the gradient is based on a subsample. Without the Metropolis correction that would ensure an exact target but at a cost of order O(n). (Which makes the technique rather difficult to recommend.)

A novel [for me] notion at the core of this paper is the concept of quasi-stationary distribution, which is the limiting distribution of a Markov chain X[t] conditional on a Markov stopping time [being larger than t]. The approach is based on diffusions with appropriate stationary distributions like the Langevin diffusion. (Actually, as in most papers I have read and remember, the current paper only considers the Langevin diffusion.) In order to avoid the issues with unadjusted and Metropolis-adjusted Langevin schemes, a killed Brownian motion is created, which means a Brownian motion conditional of being alive till time T when the instantaneous killing rate is a given function of the chain, Φ(X[t]), related with the stationary measure of the Langevin diffusion ν. Under appropriate conditions, the density of this killed Brownian motion converges [in T] to √ν. Which immediately hints at creating a new Langevin diffusion targeting ν² instead of ν. And killing it with the proper rate, which can be done by thinning a Poisson process. Simulating the original path can be done by path-space rejection sampling, following the technique set by Gareth Roberts and co-authors more than ten years ago. Based on finite dimensional realisations of the path on [0,T]. And including the killing part can be done by importance sampling and checking that the simulated killing time is larger than the current (exponentially simulated) time.

One practical difficulty in the implementation of this neat principle is the derivation of the normalising constant, which evaluation degrades with the time horizon T. The solution adopted in the paper is through a sequential Monte Carlo method, using another discretisation of the time interval [0,T] (relying on the original one would get too costly?). As for subsampling, since the survival probability for the Brownian motion is based on an unbiased estimator, subsampling does not hurt if conducted in a random manner. Although this increases the variance on principle, the use of a control variate computed just once helps in reducing the complexity to O(1).

This is a tough paper and I have not gone through the effort of trying to implement it, but this is an original and innovative construct I would like to monitor in further details on a toy example, maybe next week while in Warwick. Or at least to discuss it with the authors.

## exact, unbiased, what else?!

Posted in Books, Statistics, University life with tags , , , , , , , , on April 13, 2016 by xi'an

Last week, Matias Quiroz, Mattias Villani, and Robert Kohn arXived a paper on exact subsampling MCMC, a paper that contributes to the current literature on approximating MCMC samplers for large datasets, in connection with an earlier paper of Quiroz et al. discussed here last week.

The “exact” in the title is to be understood in the Russian roulette sense. By using Rhee and Glynn debiaising device, the authors achieve an unbiased estimator of the likelihood as in Bardenet et al. (2015). The central tool for the derivation of an unbiased and positive estimator is to find a control variate for each component of the log likelihood that is good enough for the difference between the component and the control to be lower bounded. By the constant a in the screen capture above. When the individual terms d in the product are iid unbiased estimates of the log likelihood difference. And q is the sum of the control variates. Or maybe more accurately of the cheap substitutes to the exact log likelihood components. Thus still of complexity O(n), which makes the application to tall data more difficult to contemplate.

The \$64 question is obviously how to produce cheap and efficient control variates that kill the curse of the tall data. (It still irks to resort to this term of control variate, really!) Section 3.2 in the paper suggests clustering the data and building an approximation for each cluster, which seems to imply manipulating the whole dataset at this early stage. At a cost of O(Knd). Furthermore, because finding a correct lower bound a is close to impossible in practice, the authors use a “soft lower bound”, meaning that it is only an approximation and thus that (3.4) above can get negative from time to time, which cancels the validation of the method as a pseudo-marginal approach. The resolution of this difficulty is to resort to the same proxy as in the Russian roulette paper, replacing the unbiased estimator with its absolute value, an answer I already discussed for the Russian roulette paper. An additional step is proposed by Quiroz et al., namely correlating the random numbers between numerator and denominator in their final importance sampling estimator, via a Gaussian copula as in Deligiannidis et al.

This paper made me wonder (idly wonder, mind!) anew how to get rid of the vexing unbiasedness requirement. From a statistical and especially from a Bayesian perspective, unbiasedness is a second order property that cannot be achieved for most transforms of the parameter θ. And that does not keep under reparameterisation. It is thus vexing and perplexing that unbiased is so central to the validation of our Monte Carlo technique and that any divergence from this canon leaves us wandering blindly with no guarantee of ever reaching the target of the simulation experiment…

## control functionals for Monte Carlo integration

Posted in Books, Statistics, University life with tags , , , , , on October 21, 2014 by xi'an

This new arXival by Chris Oates, Mark Girolami, and Nicolas Chopin (warning: they all are colleagues & friends of mine!, at least until they read those comments…) is a variation on control variates, but with a surprising twist namely that the inclusion of a control variate functional may produce a sub-root-n (i.e., faster than √n) convergence rate in the resulting estimator. Surprising as I did not know one could get to sub-root-n rates..! Now I had forgotten that Anne Philippe and I used the score in an earlier paper of ours, as a control variate for Riemann sum approximations, with faster convergence rates, but this is indeed a new twist, in particular because it produces an unbiased estimator.

The control variate writes

$\psi_\phi (x) = \nabla_x \cdot \phi(x) + \phi(x)\cdot \nabla \pi(x)$

where π is the target density and φ is a free function to be optimised. (Under the constraint that πφ is integrable. Then the expectation of ψφ is indeed zero.) The “explanation” for the sub-root-n behaviour is that ψφ is chosen as an L2 regression. When looking at the sub-root-n convergence proof, the explanation is more of a Rao-Blackwellisation type, assuming a first level convergent (or presistent) approximation to the integrand [of the above form ψφ can be found. The optimal φ is the solution of a differential equation that needs estimating and the paper concentrates on approximating strategies. This connects with Antonietta Mira’s zero variance control variates, but in a non-parametric manner, adopting a Gaussian process as the prior on the unknown φ. And this is where the huge innovation in the paper resides, I think, i.e. in assuming a Gaussian process prior on the control functional and in managing to preserve unbiasedness. As in many of its implementations, modelling by Gaussian processes offers nice features, like ψφ being itself a Gaussian process. Except that it cannot be shown to lead to presistency on a theoretical basis. Even though it appears to hold in the examples of the paper. Apart from this theoretical difficulty, the potential hardship with the method seems to be in the implementation, as there are several parameters and functionals to be calibrated, hence calling for cross-validation which may often be time-consuming. The gains are humongous, so the method should be adopted whenever the added cost in implementing it is reasonable, cost which evaluation is not clearly provided by the paper. In the toy Gaussian example where everything can be computed, I am surprised at the relatively poor performance of a Riemann sum approximation to the integral, wondering at the level of quadrature involved therein. The paper also interestingly connects with O’Hagan’s (1991) Bayes-Hermite [polynomials] quadrature and quasi-Monte Carlo [obviously!].

## a week in Warwick

Posted in Books, Kids, Running, Statistics, University life with tags , , , , , , , , , , , , on October 19, 2014 by xi'an

This past week in Warwick has been quite enjoyable and profitable, from staying once again in a math house, to taking advantage of the new bike, to having several long discussions on several prospective and exciting projects, to meeting with some of the new postdocs and visitors, to attending Tony O’Hagan’s talk on “wrong models”. And then having Simo Särkkä who was visiting Warwick this week discussing his paper with me. And Chris Oates doing the same with his recent arXival with Mark Girolami and Nicolas Chopin (soon to be commented, of course!). And managing to run in dry conditions despite the heavy rains (but in pitch dark as sunrise is now quite late, with the help of a headlamp and the beauty of a countryside starry sky). I also evaluated several students’ projects, two of which led me to wonder when using RJMCMC was appropriate in comparing two models. In addition, I also eloped one evening to visit old (1977!) friends in Northern Birmingham, despite fairly dire London Midlands performances between Coventry and Birmingham New Street, the only redeeming feature being that the connecting train there was also late by one hour! (Not mentioning the weirdest taxi-driver ever on my way back, trying to get my opinion on whether or not he should have an affair… which at least kept me awake the whole trip!) Definitely looking forward my next trip there at the end of November.