## delayed but published!

Posted in Statistics with tags , , , , , , on June 20, 2019 by xi'an

## MCMC importance samplers for intractable likelihoods

Posted in Books, pictures, Statistics with tags , , , , , , , , , , , on May 3, 2019 by xi'an

Jordan Franks just posted on arXiv his PhD dissertation at the University of Jyväskylä, where he discuses several of his works:

1. M. Vihola, J. Helske, and J. Franks. Importance sampling type estimators based on approximate marginal MCMC. Preprint arXiv:1609.02541v5, 2016.
2. J. Franks and M. Vihola. Importance sampling correction versus standard averages of reversible MCMCs in terms of the asymptotic variance. Preprint arXiv:1706.09873v4, 2017.
3. J. Franks, A. Jasra, K. J. H. Law and M. Vihola.Unbiased inference for discretely observed hidden Markov model diffusions. Preprint arXiv:1807.10259v4, 2018.
4. M. Vihola and J. Franks. On the use of ABC-MCMC with inflated tolerance and post-correction. Preprint arXiv:1902.00412, 2019

focusing on accelerated approximate MCMC (in the sense of pseudo-marginal MCMC) and delayed acceptance (as in our recently accepted paper). Comparing delayed acceptance with MCMC importance sampling to the advantage of the later. And discussing the choice of the tolerance sequence for ABC-MCMC. (Although I did not get from the thesis itself the target of the improvement discussed.)

## optimal choice among MCMC kernels

Posted in Statistics with tags , , , , , , , , , , on March 14, 2019 by xi'an

Last week in Siem Reap, Florian Maire [who I discovered originates from a Norman town less than 10km from my hometown!] presented an arXived joint work with Pierre Vandekerkhove at the Data Science & Finance conference in Cambodia that considers the following problem: Given a large collection of MCMC kernels, how to pick the best one and how to define what best means. Going by mixtures is a default exploration of the collection, as shown in (Tierney) 1994 for instance since this improves on both kernels (esp. when each kernel is not irreducible on its own!). This paper considers a move to local weights in the mixture, weights that are not estimated from earlier simulations, contrary to what I first understood.

As made clearer in the paper the focus is on filamentary distributions that are concentrated nearby lower-dimension sets or manifolds Since then the components of the kernel collections can be restricted to directions of these manifolds… Including an interesting case of a 2-D highly peaked target where converging means mostly simulating in x¹ and covering the target means mostly simulating in x². Exhibiting a schizophrenic tension between the two goals. Weight locally dependent means correction by Metropolis step, with cost O(n). What of Rao-Blackwellisation of these mixture weights, from weight x transition to full mixture, as in our PMC paper? Unclear to me as well [during the talk] is the use in the mixture of basic Metropolis kernels, which are not absolutely continuous, because of the Dirac mass component. But this is clarified by Section 5 in the paper. A surprising result from the paper (Corollary 1) is that the use of local weights ω(i,x) that depend on the current value of the chain does jeopardize the stationary measure π(.) of the mixture chain. Which may be due to the fact that all components of the mixture are already π-invariant. Or that the index of the kernel constitutes an auxiliary (if ancillary)  variate. (Algorithm 1 in the paper reminds me of delayed acceptance. Making me wonder if computing time should be accounted for.) A final question I briefly discussed with Florian is the extension to weights that are automatically constructed from the simulations and the target.

## scalable Metropolis-Hastings

Posted in Books, Statistics, Travel with tags , , , , , , , , , on February 12, 2019 by xi'an

Among the flury of arXived papers of last week (414!), including a fair chunk of papers submitted to ICML 2019, I spotted one entry by Cornish et al. on scalable Metropolis-Hastings, which Arnaud Doucet had mentioned to me yesterday when in Oxford. The paper builds on the delayed acceptance paper we wrote with Marco Banterlé, Clara Grazian and Anthony Lee, itself relying on a factorisation decomposition of the likelihood, combined with control variate accelerating techniques. The factorisation of both the target and the proposal allows for a (less efficient) Metropolis-Hastings acceptance ratio that is the product

$\prod_{i=1}^m \alpha_i(\theta,\theta')$

of individual Metropolis-Hastings acceptance ratios, but which allows for quicker rejection if one of the probabilities in the product is small, because the corresponding Bernoulli draw is zero with high probability. One advance made in Michel et al. (2017) [which I doubly missed] is that subsampling is achievable by thinning (as in PDMPs, where these authors have been quite active) through an algorithm of Shantikumar (1985) [described in Devroye’s bible]. Provided each Metropolis-Hastings probability can be lower bounded:

$\alpha_i(\theta,\theta') \ge \exp\{-\psi_i \phi(\theta,\theta')\}$

by a term where the transition φ does not depend on the index i in the product. The computing cost of the thinning process thus depends on the efficiency of the subsampling, namely whether or not the (Poisson) number of terms is much smaller than m, number of terms in the product. A neat trick in the current paper that extends the the Fukui-Todo procedure is to switch to the original Metropolis-Hastings when the overall lower bound is too small, recovering the geometric ergodicity of this original if it holds (Theorem 2.1). Another neat remark is that when using the naïve factorisation as the product of the n individual likelihoods, the resulting algorithm is sort of doomed as n grows, even with an optimal scaling of the proposals. To achieve scalability, the authors introduce a Taylor (i.e., Gaussian) approximation to each local target in the product and start the acceptance decomposition by using the resulting overall Gaussian approximation. Meaning that the remaining product is now made of ratios of targets over their local Taylor approximations, hence most likely close to one. And potentially lower-bounded by the remainder term in the Taylor expansion. Leading to the conclusion that, when everything goes well, meaning that the Taylor expansions can be conducted and the bounds derived for the appropriate expansion, the order of the Poisson scale is O(1/√n)..! The proposal for the Metropolis-Hastings move is actually tuned to the Gaussian approximation, appearing as a variant of the Langevin move or more exactly a discretization of an Hamiltonian move. Obviously, I cannot judge of the complexity in implementing this new scheme from just reading the paper, but this development on the split target is definitely an exciting prospect for handling huge datasets and their friends!

## IMS workshop [day 5]

Posted in Books, pictures, Statistics, Travel with tags , , , , , , , , on September 3, 2018 by xi'an

The last day of the starting workshop [and my last day in Singapore] was a day of importance [sampling] with talks by Matti Vihola opposing importance sampling and delayed acceptance and particle MCMC, related to several papers of his that I missed. To be continued in the coming weeks at the IMS, which is another reason to regret having to leave that early [as my Parisian semester starts this Monday with an undergrad class at 8:30!]

And then a talk by Joaquín Miguez on stabilizing importance sampling by truncation which reminded me very much of the later work by Andrew Gelman and Aki Vehtari on Pareto smoothed importance sampling, with further operators adapted to sequential settings and the similar drawback that when the importance sampler is poor, i.e., when the simulated points are all very far from the centre of mass, no amount of fudging with the weights will bring the points closer. AMIS made an appearance as a reference method, to be improved by this truncation of the weights, a wee bit surprising as it should bring the large weights of the earlier stages down.

Followed by an almost silent talk by Nick Whiteley, who having lost his voice to the air conditioning whispered his talk in the microphone. Having once faced a lost voice during an introductory lecture to a large undergraduate audience, I could not but completely commiserate for the hardship of the task. Although this made the audience most silent and attentive. His topic was the Viterbi process and its parallelisation, by using a truncated horizon (presenting connection with overdamped Langevin, eg Durmus and Moulines and Dalalyan).

And due to a pressing appointment with my son and his girlfriend [who were traveling through Singapore on that day] for a chili crab dinner on my way to the airport, I missed the final talk by Arnaud Doucet, where he was to reconsider PDMP algorithms without the continuous time layer, a perspective I find most appealing!

Overall, this was a quite diverse and rich [starting] seminar, backed by the superb organisation of the IMS and the smooth living conditions on the NUS campus [once I had mastered the bus routes], which would have made much more sense for me as part of a longer stay, which is actually what happened the previous time I visited the IMS (in 2005), again clashing with my course schedule at home… And as always, I am impressed with the city-state of Singapore, for the highly diverse food scene in particular, but also this [maybe illusory] impression of coexistence between communities. And even though the ecological footprint could certainly be decreased, measures to curb car ownership (with a 150% purchase tax) and use (with congestion charges).

## MCMC with multiple tries

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , on April 5, 2018 by xi'an

Earlier this year, Luca Martino wrote and arXived a review on multiple try MCMC. As its name suggests, the starting point of this algorithm is to propose N potential moves simultaneously instead of one, possibly according to N different proposal (conditional) densities, and to select one by a normalised importance sampling weight. The move is accepted by a Metropolis-Hastings step based on the ratio of the normalisation constants [at the current and at the one-before-current stages]. Besides the cost of computing the summation and generating the different variates, this method also faces the drawback of requiring N-1 supplementary simulations that are only used for achieving detailed balance and computing a backward summation of importance weights. (A first section of the review is dedicated to independent Metropolis-Hastings proposals, q(θ), which make life simpler, but are less realistic in my opinion since some prior knowledge or experimentation is necessary to build a relevant distribution q(θ).) An alternative covered in the survey is ensemble Monte Carlo (Neal, 2011), which produces a whole sample at each iteration, with target the product of the initial targets. This reminded me of our pinball sampler, which aimed at producing a spread-out sample while keeping the marginal correct. Although the motivation sounds closer to a particle sampler. Especially with this associated notion of an empirical approximation of the target. The next part of the review is about delayed rejection, which is a natural alternative approach to speeding up MCMC by considering several possibilities, if sequentially. Started in Antonietta Mira‘s 1999 PhD thesis. The difficulty with this approach is that the acceptance probability gets increasingly complex as the number of delays grows, which may annihilate its appeal relative to simultaneous multiple tries.

## delayed acceptance ABC-SMC

Posted in pictures, Statistics, Travel with tags , , , , , , , on December 11, 2017 by xi'an

Last summer, during my vacation on Skye,  Richard Everitt and Paulina Rowińska arXived a paper on delayed acceptance associated with ABC. ArXival that I missed, then! In order to decrease the number of simulations from the likelihood. As in our own delayed acceptance paper (without ABC), a cheap alternative generator is used to first reject the least likely parameters values, before possibly continuing to use a full generator. Also as lazy ABC. The first step of this ABC algorithm requires a cheap generator plus a primary tolerance ε¹ to compare the generation with the data or part of it. This may be followed by a second generation with a second tolerance level ε². The paper applies more specifically ABC-SMC as introduced in Sisson, Fan and Tanaka (2007) and reassessed in our subsequent 2009 Biometrika paper with Mark Beaumont, Jean-Marie Cornuet and Jean-Michel Marin. As well as in the ABC-SMC paper by Pierre Del Moral and Arnaud Doucet.

When looking at the version of the algorithm [Algorithm 2] based on two basic acceptance ABC steps, there are two features I find intriguing: (i) the primary step uses a cheap generator to reject early poor values of the parameter, followed by the second step involving a more expensive and exact generator, but I see no impact of the choice of this cheap generator in the acceptance probability; (ii) this is an SMC algorithm with imposed resampling at each iteration but there is no visible step for creating new weights after the resampling step. In the current presentation, it sounds like the weights do not change from the initial step, except for those turning to zero and the renormalisation transforms. Which makes the (unspecified) stratification of little interest if any. I must therefore miss a point in the implementation!

One puzzling sentence in the appendix is that the resampling algorithm used in the SMC step “ensures that every particle that is alive before resampling is represented in the resampled particles”, which reminds me of an argument [possibly a different one] made already in Sisson, Fan and Tanaka (2007) and that we could not validate in our subsequent paper. For resampling to be correct, a form of multinomial sampling must be implemented, even via variance reduction schemes like stratified or systematic sampling.