## ABC-SAEM

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , , , , on October 8, 2019 by xi'an

In connection with the recent PhD thesis defence of Juliette Chevallier, in which I took a somewhat virtual part for being physically in Warwick, I read a paper she wrote with Stéphanie Allassonnière on stochastic approximation versions of the EM algorithm. Computing the MAP estimator can be done via some adapted for simulated annealing versions of EM, possibly using MCMC as for instance in the Monolix software and its MCMC-SAEM algorithm. Where SA stands sometimes for stochastic approximation and sometimes for simulated annealing, originally developed by Gilles Celeux and Jean Diebolt, then reframed by Marc Lavielle and Eric Moulines [friends and coauthors]. With an MCMC step because the simulation of the latent variables involves an untractable normalising constant. (Contrary to this paper, Umberto Picchini and Adeline Samson proposed in 2015 a genuine ABC version of this approach, paper that I thought I missed—although I now remember discussing it with Adeline at JSM in Seattle—, ABC is used as a substitute for the conditional distribution of the latent variables given data and parameter. To be used as a substitute for the Q step of the (SA)EM algorithm. One more approximation step and one more simulation step and we would reach a form of ABC-Gibbs!) In this version, there are very few assumptions made on the approximation sequence, except that it converges with the iteration index to the true distribution (for a fixed observed sample) if convergence of ABC-SAEM is to happen. The paper takes as an illustrative sequence a collection of tempered versions of the true conditionals, but this is quite formal as I cannot fathom a feasible simulation from the tempered version and not from the untempered one. It is thus much more a version of tempered SAEM than truly connected with ABC (although a genuine ABC-EM version could be envisioned).

## MCMC importance samplers for intractable likelihoods

Posted in Books, pictures, Statistics with tags , , , , , , , , , , , on May 3, 2019 by xi'an

Jordan Franks just posted on arXiv his PhD dissertation at the University of Jyväskylä, where he discuses several of his works:

1. M. Vihola, J. Helske, and J. Franks. Importance sampling type estimators based on approximate marginal MCMC. Preprint arXiv:1609.02541v5, 2016.
2. J. Franks and M. Vihola. Importance sampling correction versus standard averages of reversible MCMCs in terms of the asymptotic variance. Preprint arXiv:1706.09873v4, 2017.
3. J. Franks, A. Jasra, K. J. H. Law and M. Vihola.Unbiased inference for discretely observed hidden Markov model diffusions. Preprint arXiv:1807.10259v4, 2018.
4. M. Vihola and J. Franks. On the use of ABC-MCMC with inflated tolerance and post-correction. Preprint arXiv:1902.00412, 2019

focusing on accelerated approximate MCMC (in the sense of pseudo-marginal MCMC) and delayed acceptance (as in our recently accepted paper). Comparing delayed acceptance with MCMC importance sampling to the advantage of the later. And discussing the choice of the tolerance sequence for ABC-MCMC. (Although I did not get from the thesis itself the target of the improvement discussed.)

## Bayesian synthetic likelihood

Posted in Statistics with tags , , , , , , , on December 13, 2017 by xi'an

Leah Price, Chris Drovandi, Anthony Lee and David Nott published earlier this year a paper in JCGS on Bayesian synthetic likelihood, using Simon Wood’s synthetic likelihood as a substitute to the exact likelihood within a Bayesian approach. While not investigating the theoretical properties of this approximate approach, the paper compares it with ABC on some examples. In particular with respect to the number n of Monte Carlo replications used to approximate the mean and variance of the Gaussian synthetic likelihood.

Since this approach is most naturally associated with an MCMC implementation, it requires new simulations of the summary statistics at each iteration, without a clear possibility to involve parallel runs, in contrast to ABC. However in the final example of the paper, the authors reach values of n of several thousands, making use of multiple cores relevant, if requiring synchronicity and checks at every MCMC iteration.

The authors mention that “ABC can be viewed as a pseudo-marginal method”, but this has a limited appeal since the pseudo-marginal is a Monte Carlo substitute for the ABC target, not the original target. Similarly, there exists an unbiased estimator of the Gaussian density due to Ghurye and Olkin (1969) that allows to perceive the estimated synthetic likelihood version as a pseudo-marginal, once again wrt a target that differs from the original one. And the bias reappears under mis-specification, that is when the summary statistics are not normally distributed. It seems difficult to assess this normality or absence thereof in realistic situations.

“However, when the distribution of the summary statistic is highly irregular, the output of BSL cannot be trusted, while ABC represents a robust alternative in such cases.”

To make synthetic likelihood and ABC algorithms compatible, the authors chose a Normal kernel for ABC. Still, the equivalence is imperfect in that the covariance matrix need be chosen in the ABC case and is estimated in the synthetic one. I am also lost to the argument that the synthetic version is more efficient than ABC, in general (page 8). As for the examples, the first one uses a toy Poisson posterior with a single sufficient summary statistic, which is not very representative of complex situations where summary statistics are extremes or discrete. As acknowledged by the authors this is a case when the Normality assumption applies. For an integer support hidden process like the Ricker model, normality vanishes and the outcomes of ABC and synthetic likelihood differ, which makes it difficult to compare the inferential properties of both versions (rather than the acceptance rates), while using a 13-dimension statistic for estimating a 3-dimension parameter is not recommended for ABC, as discussed by Li and Fearnhead (2017). The same issue appears in the realistic cell motility example, with 145 summaries versus two parameters. (In the philogenies studied by DIYABC, the number of summary statistics is about the same but we now advocate a projection to the parameter dimension by the medium of random forests.)

Given the similarity between both approaches, I wonder at a confluence between them, where synthetic likelihood could maybe be used to devise PCA on the summary statistics and facilitate their projection on a space with much smaller dimensions. Or estimating the mean and variance functions in the synthetic likelihood towards producing directly simulations of the summary statistics.

## delayed acceptance ABC-SMC

Posted in pictures, Statistics, Travel with tags , , , , , , , on December 11, 2017 by xi'an

Last summer, during my vacation on Skye,  Richard Everitt and Paulina Rowińska arXived a paper on delayed acceptance associated with ABC. ArXival that I missed, then! In order to decrease the number of simulations from the likelihood. As in our own delayed acceptance paper (without ABC), a cheap alternative generator is used to first reject the least likely parameters values, before possibly continuing to use a full generator. Also as lazy ABC. The first step of this ABC algorithm requires a cheap generator plus a primary tolerance ε¹ to compare the generation with the data or part of it. This may be followed by a second generation with a second tolerance level ε². The paper applies more specifically ABC-SMC as introduced in Sisson, Fan and Tanaka (2007) and reassessed in our subsequent 2009 Biometrika paper with Mark Beaumont, Jean-Marie Cornuet and Jean-Michel Marin. As well as in the ABC-SMC paper by Pierre Del Moral and Arnaud Doucet.

When looking at the version of the algorithm [Algorithm 2] based on two basic acceptance ABC steps, there are two features I find intriguing: (i) the primary step uses a cheap generator to reject early poor values of the parameter, followed by the second step involving a more expensive and exact generator, but I see no impact of the choice of this cheap generator in the acceptance probability; (ii) this is an SMC algorithm with imposed resampling at each iteration but there is no visible step for creating new weights after the resampling step. In the current presentation, it sounds like the weights do not change from the initial step, except for those turning to zero and the renormalisation transforms. Which makes the (unspecified) stratification of little interest if any. I must therefore miss a point in the implementation!

One puzzling sentence in the appendix is that the resampling algorithm used in the SMC step “ensures that every particle that is alive before resampling is represented in the resampled particles”, which reminds me of an argument [possibly a different one] made already in Sisson, Fan and Tanaka (2007) and that we could not validate in our subsequent paper. For resampling to be correct, a form of multinomial sampling must be implemented, even via variance reduction schemes like stratified or systematic sampling.

## asymptotically exact inference in likelihood-free models [a reply from the authors]

Posted in R, Statistics with tags , , , , , , , , , , , , , , , , , on December 1, 2016 by xi'an

[Following my post of lastTuesday, Matt Graham commented on the paper with force détails. Here are those comments. A nicer HTML version of the Markdown reply below is also available on Github.]

Thanks for the comments on the paper!

A few additional replies to augment what Amos wrote:

This however sounds somewhat intense in that it involves a quasi-Newton resolution at each step.

The method is definitely computationally expensive. If the constraint function is of the form of a function from an M-dimensional space to an N-dimensional space, with MN, for large N the dominant costs at each timestep are usually the constraint Jacobian (c/u) evaluation (with reverse-mode automatic differentiation this can be evaluated at a cost of O(N) generator / constraint evaluations) and Cholesky decomposition of the Jacobian product (c/u)(c/u) with O(N³) cost (though in many cases e.g. i.i.d. or Markovian simulated data, structure in the generator Jacobian can be exploited to give a significantly reduced cost). Each inner Quasi-Newton update involves a pair of triangular solve operations which have a O(N²) cost, two matrix-vector multiplications with O(MN) cost, and a single constraint / generator function evaluation; the number of Quasi-Newton updates required for convergence in the numerical experiments tended to be much less than N hence the Quasi-Newton iteration tended not to be the main cost.

The high computation cost per update is traded off however with often being able to make much larger proposed moves in high-dimensional state spaces with a high chance of acceptance compared to ABC MCMC approaches. Even in the relatively small Lotka-Volterra example we provide which has an input dimension of 104 (four inputs which map to ‘parameters’, and 100 inputs which map to ‘noise’ variables), the ABC MCMC chains using the coarse ABC kernel radius ϵ=100 with comparably very cheap updates were significantly less efficient in terms of effective sample size / computation time than the proposed constrained HMC approach. This was in large part due to the elliptical slice sampling updates in the ABC MCMC chains generally collapsing down to very small moves even for this relatively coarse ϵ. Performance was even worse using non-adaptive ABC MCMC methods and for smaller ϵ, and for higher input dimensions (e.g. using a longer sequence with correspondingly more random inputs) the comparison becomes even more favourable for the constrained HMC approach. Continue reading

## a simulated annealing approach to Bayesian inference

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , on October 1, 2015 by xi'an

A misleading title if any! Carlos Albert arXived a paper with this title this morning and I rushed to read it. Because it sounded like Bayesian analysis could be expressed as a special form of simulated annealing. But it happens to be a rather technical sequel [“that complies with physics standards”] to another paper I had missed, A simulated annealing approach to ABC, by Carlos Albert, Hans Künsch, and Andreas Scheidegger. Paper that appeared in Statistics and Computing last year, and which is most interesting!

“These update steps are associated with a flow of entropy from the system (the ensemble of particles in the product space of parameters and outputs) to the environment. Part of this flow is due to the decrease of entropy in the system when it transforms from the prior to the posterior state and constitutes the well-invested part of computation. Since the process happens in finite time, inevitably, additional entropy is produced. This entropy production is used as a measure of the wasted computation and minimized, as previously suggested for adaptive simulated annealing” (p.3)

The notion behind this simulated annealing intrusion into the ABC world is that the choice of the tolerance can be adapted along iterations according to a simulated annealing schedule. Both papers make use of thermodynamics notions that are completely foreign to me, like endoreversibility, but aim at minimising the “entropy production of the system, which is a measure for the waste of computation”. The central innovation is to introduce an augmented target on (θ,x) that is

f(x|θ)π(θ)exp{-ρ(x,y)/ε},

where ε is the tolerance, while ρ(x,y) is a measure of distance to the actual observations, and to treat ε as an annealing temperature. In an ABC-MCMC implementation, the acceptance probability of a random walk proposal (θ’,x’) is then

exp{ρ(x,y)/ε-ρ(x’,y)/ε}∧1.

Under some regularity constraints, the sequence of targets converges to

π(θ|y)exp{-ρ(x,y)},

if ε decreases slowly enough to zero. While the representation of ABC-MCMC through kernels other than the Heaviside function can be found in the earlier ABC literature, the embedding of tolerance updating within the modern theory of simulated annealing is rather exciting.

Furthermore, we will present an adaptive schedule that attempts convergence to the correct posterior while minimizing the required simulations from the likelihood. Both the jump distribution in parameter space and the tolerance are adapted using mean fields of the ensemble.” (p.2)

What I cannot infer from a rather quick perusal of the papers is whether or not the implementation gets into the way of the all-inclusive theory. For instance, how can the Markov chain keep moving as the tolerance gets to zero? Even with a particle population and a sequential Monte Carlo implementation, it is unclear why the proposal scale factor [as in equation (34)] does not collapse to zero in order to ensure a non-zero acceptance rate. In the published paper, the authors used the same toy mixture example as ours [from Sisson et al., 2007], where we earned the award of the “incredibly ugly squalid picture”, with improvements in the effective sample size, but this remains a toy example. (Hopefully a post to be continued in more depth…)

## ergodicity of approximate MCMC chains with applications to large datasets

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , on August 31, 2015 by xi'an

Another arXived paper I read on my way to Warwick! And yet another paper written by my friend Natesh Pillai (and his co-author Aaron Smith, from Ottawa). The goal of the paper is to study the ergodicity and the degree of approximation of the true posterior distribution of approximate MCMC algorithms that recently flourished as an answer to “Big Data” issues… [Comments below are about the second version of this paper.] One of the most curious results in the paper is the fact that the approximation may prove better than the original kernel, in terms of computing costs! If asymptotically in the computing cost. There also are acknowledged connections with the approximative MCMC kernel of Pierre Alquier, Neal Friel, Richard Everitt and A Boland, briefly mentioned in an earlier post.

The paper starts with a fairly theoretical part, to follow with an application to austerity sampling [and, in the earlier version of the paper, to the Hoeffding bounds of Bardenet et al., both discussed earlier on the ‘Og, to exponential random graphs (the paper being rather terse on the description of the subsampling mechanism), to stochastic gradient Langevin dynamics (by Max Welling and Yee-Whye Teh), and to ABC-MCMC]. The assumptions are about the transition kernels of a reference Markov kernel and of one associated with the approximation, imposing some bounds on the Wasserstein distance between those kernels, K and K’. Results being generic, there is no constraint as to how K is chosen or on how K’ is derived from K. Except in Lemma 3.6 and in the application section, where the same proposal kernel L is used for both Metropolis-Hastings algorithms K and K’. While I understand this makes for an easier coupling of the kernels, this also sounds like a restriction to me in that modifying the target begs for a similar modification in the proposal, if only because the tails they are a-changin’

In the case of subsampling the likelihood to gain computation time (as discussed by Korattikara et al. and by Bardenet et al.), the austerity algorithm as described in Algorithm 2 is surprising as the average of the sampled data log-densities and the log-transform of the remainder of the Metropolis-Hastings probability, which seem unrelated, are compared until they are close enough.  I also find hard to derive from the different approximation theorems bounding exceedance probabilities a rule to decide on the subsampling rate as a function of the overall sample size and of the computing cost. (As a side if general remark, I remain somewhat reserved about the subsampling idea, given that it requires the entire dataset to be available at every iteration. This makes parallel implementations rather difficult to contemplate.)