## Bayesian synthetic likelihood

Posted in Statistics with tags , , , , , , , on December 13, 2017 by xi'an

Leah Price, Chris Drovandi, Anthony Lee and David Nott published earlier this year a paper in JCGS on Bayesian synthetic likelihood, using Simon Wood’s synthetic likelihood as a substitute to the exact likelihood within a Bayesian approach. While not investigating the theoretical properties of this approximate approach, the paper compares it with ABC on some examples. In particular with respect to the number n of Monte Carlo replications used to approximate the mean and variance of the Gaussian synthetic likelihood.

Since this approach is most naturally associated with an MCMC implementation, it requires new simulations of the summary statistics at each iteration, without a clear possibility to involve parallel runs, in contrast to ABC. However in the final example of the paper, the authors reach values of n of several thousands, making use of multiple cores relevant, if requiring synchronicity and checks at every MCMC iteration.

The authors mention that “ABC can be viewed as a pseudo-marginal method”, but this has a limited appeal since the pseudo-marginal is a Monte Carlo substitute for the ABC target, not the original target. Similarly, there exists an unbiased estimator of the Gaussian density due to Ghurye and Olkin (1969) that allows to perceive the estimated synthetic likelihood version as a pseudo-marginal, once again wrt a target that differs from the original one. And the bias reappears under mis-specification, that is when the summary statistics are not normally distributed. It seems difficult to assess this normality or absence thereof in realistic situations.

“However, when the distribution of the summary statistic is highly irregular, the output of BSL cannot be trusted, while ABC represents a robust alternative in such cases.”

To make synthetic likelihood and ABC algorithms compatible, the authors chose a Normal kernel for ABC. Still, the equivalence is imperfect in that the covariance matrix need be chosen in the ABC case and is estimated in the synthetic one. I am also lost to the argument that the synthetic version is more efficient than ABC, in general (page 8). As for the examples, the first one uses a toy Poisson posterior with a single sufficient summary statistic, which is not very representative of complex situations where summary statistics are extremes or discrete. As acknowledged by the authors this is a case when the Normality assumption applies. For an integer support hidden process like the Ricker model, normality vanishes and the outcomes of ABC and synthetic likelihood differ, which makes it difficult to compare the inferential properties of both versions (rather than the acceptance rates), while using a 13-dimension statistic for estimating a 3-dimension parameter is not recommended for ABC, as discussed by Li and Fearnhead (2017). The same issue appears in the realistic cell motility example, with 145 summaries versus two parameters. (In the philogenies studied by DIYABC, the number of summary statistics is about the same but we now advocate a projection to the parameter dimension by the medium of random forests.)

Given the similarity between both approaches, I wonder at a confluence between them, where synthetic likelihood could maybe be used to devise PCA on the summary statistics and facilitate their projection on a space with much smaller dimensions. Or estimating the mean and variance functions in the synthetic likelihood towards producing directly simulations of the summary statistics.

## simple, scalable and accurate posterior interval estimation

Posted in Statistics with tags , , , , , , , on July 6, 2016 by xi'an

“There is a lack of simple and scalable algorithms for uncertainty quantification.”

A paper by Cheng Li , Sanvesh Srivastava, and David Dunson that I had missed and which was pointed out on Andrew’s blog two days ago. As recalled in the very first sentence of the paper, above, the existing scalable MCMC algorithms somewhat fail to account for confidence (credible) intervals. In the sense that handling parallel samples does not naturally produce credible intervals.Since the approach is limited to one-dimensional quantity of interest, ζ=h(θ), the authors of the paper consider the MCMC approximations of the cdf of the said quantity ζ based on the manageable subsets like as many different approximations of the same genuine posterior distribution of that quantity ζ. (Corrected by a power of the likelihood but dependent on the particular subset used for the estimation.) The estimate proposed in the paper is a Wasserstein barycentre of the available estimations, barycentre that is defined as minimising the sum of the Wasserstein distances to all estimates. (Why should this measure be relevant: the different estimates may be of different quality). Interestingly (at least at a computational level), the solution is such that the quantile function of the Wasserstein barycentre is the average of the estimated quantiles functions. (And is there an alternative loss returning the median cdf?) A confidence interval based on the quantile function can then be directly derived. The paper shows that this Wasserstein barycentre converges to the true (marginal) posterior as the sample size m of each sample grows to infinity (and faster than 1/√m), with the strange side-result that the convergence is in 1/√n when the MLE of the global parameter θ is unbiased. Strange to me because unbiasedness is highly dependent on parametrisation while the performances of this estimator should not be, i.e., should be invariant under reparameterisation. Maybe this is due to ζ being a linear transform of θ in the convergence theorem… In any case, I find this question of merging cdf’s from poorly defined approximations to an unknown cdf of the highest interest and look forward any further proposal to this effect!

## patterns of scalable Bayesian inference

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , on February 24, 2016 by xi'an

Elaine Angelino, Matthew Johnson and Ryan Adams just arXived a massive survey of 118 pages on scalable Bayesian inference, which could have been entitled Bayes for Big Data, as this monograph covers state-of-the-art computational approaches to large and complex data structures. I did not read each and every line of it, but I have already recommended it to my PhD students. Some of its material unsurprisingly draws from the recent survey by Rémi Bardenet et al. (2015) I discussed a while ago. It also relates rather frequently to the somewhat parallel ICML paper of Korattikara et al. (2014). And to the firefly Monte Carlo procedure also discussed previously here.

Chapter 2 provides some standard background on computational techniques, Chapter 3 covers MCMC with data subsets, Chapter 4 gives some entries on MCMC with parallel and distributed architectures, Chapter 5 focus on variational solutions, and Chapter 6 is about open questions and challenges.

“Insisting on zero asymptotic bias from Monte Carlo estimates of expectations may leave us swamped in errors from high variance or transient bias.”

One central theme of the paper is the need for approximate solutions, MCMC being perceived as the exact solution. (Somewhat wrongly in the sense that the product of an MCMC is at best an empirical version of the true posterior, hence endowed with a residual and incompressible variation for a given computing budget.) While Chapter 3 stresses the issue of assessing the distance to the true posterior, it does not dwell at all on computing times and budget, which is arguably a much harder problem. Chapter 4 seems to be more aware of this issue since arguing that “a way to use parallel computing resources is to run multiple sequential MCMC algorithms at once [but that this] does not reduce the transient bias in MCMC estimates of posterior expectations” (p.54). The alternatives are to use either prefetching (which was the central theme of Elaine Angelino’s thesis), asynchronous Gibbs with the new to me (?) Hogwild Gibbs algorithms (connected in Terenin et al.’s recent paper, not quoted in the paper), some versions of consensus Monte Carlo covered in earlier posts, the missing links being in my humble opinion an assessment of the worth of those solutions (in the spirit of “here’s the solution, what was the problem again?”) and once again the computing time issue. Chapter 5 briefly discusses some recent developments in variational mean field approximations, which is farther from my interests and (limited) competence, but which appears as a particular class of approximate models and thus could (and should?) relate to likelihood-free methods. Chapter 6 about the current challenges of the field is presumably the most interesting in this monograph in that it produces open questions and suggests directions for future research. For instance, opposing the long term MCMC error with the short term transient part. Or the issue of comparing different implementations in a practical and timely perspective.

## Optimization Monte Carlo: Efficient and embarrassingly parallel likelihood-free inference

Posted in Books, Statistics, Travel with tags , , , , , , , , on December 16, 2015 by xi'an

Ted Meeds and Max Welling have not so recently written about an embarrassingly parallel approach to ABC that they call optimisation Monte Carlo. [Danke Ingmar for pointing out the reference to me.] They start from a rather innocuous rephrasing of the ABC posterior, writing the pseudo-observations as deterministic transforms of the parameter and of a vector of uniforms. Innocuous provided this does not involve an infinite number of uniforms, obviously. Then they suddenly switch to the perspective that, for a given uniform vector u, one should seek the parameter value θ that agrees with the observation y. A sort of Monte Carlo inverse regression: if

y=f(θ,u),

then invert this equation in θ. This is quite clever! Maybe closer to fiducial than true Bayesian statistics, since the prior does not occur directly [only as a weight p(θ)], but if this is manageable [and it all depends on the way f(θ,u) is constructed], this should perform better than ABC! After thinking about it a wee bit more in London, though, I realised this was close to impossible in the realistic examples I could think of. But I still like the idea and want to see if anything at all can be made of this…

“However, it is hard to detect if our optimization succeeded and we may therefore sometimes reject samples that should not have been rejected. Thus, one should be careful not to create a bias against samples u for which the optimization is difficult. This situation is similar to a sampler that will not mix to remote local optima in the posterior distribution.”

Now, the paper does not go that way but keeps the ε-ball approach as in regular ABC, to derive an approximation of the posterior density. For a while I was missing the difference between the centre of the ball and the inverse of the above equation, bottom of page 3. But then I realised the former was an approximation to the latter. When the authors discuss their approximation in terms of the error ε, I remain unconvinced by the transfer of the tolerance to the optimisation error, as those are completely different notions. This also applies to the use of a Jacobian in the weight, which seems out of place since this Jacobian appears in a term associated with (or replacing) the likelihood, f(θ,u), which is then multiplied by the prior p(θ). (Assuming a Jacobian exists, which is unclear when considering most simulation patterns use hard bounds and indicators.) When looking at the toy examples, it however makes sense to have a Jacobian since the selected θ’s are transforms of the u’s. And the p(θ)’s are simply importance weights correcting for the wrong target. Overall, the appeal of the method proposed in the paper remains unclear to me. Most likely because I did not spend enough time over it.

## parallelizing MCMC with random partition trees

Posted in Books, pictures, Statistics, University life with tags , , , , , , , on July 7, 2015 by xi'an

Another arXived paper in the recent series about big or tall data and how to deal with it by MCMC. Which pertains to the embarrassingly parallel category. As in the previously discussed paper, the authors (Xiangyu Wang, Fangjian Guo, Katherine Heller, and David Dunson) chose to break the prior itself into m bits… (An additional point from last week criticism is that, were an unbiased estimator of each term in the product available in an independent manner, the product of the estimators would be the estimator of the product.) In this approach, the kernel estimator of Neiswanger et al. is replaced with a random partition tree histogram. Which uses the same block partition across all terms in the product representation of the posterior. And hence ends up with a smaller number of terms in the approximation, since it does not explode with m. (They could have used Mondrian forests as well! However I think their quantification of the regular kernel method cost as an O(Tm) approach does not account for Neiswanger et al.’s trick in exploiting the product of kernels…) The so-called tree estimate can be turned into a random forest by repeating the procedure several times and averaging. The simulation comparison runs in favour of the current method when compared with other consensus or non-parametric methods. Except in the final graph (Figure 5) which shows several methods achieving the same prediction accuracy against running time.

## variational consensus Monte Carlo

Posted in Books, Statistics, University life with tags , , , , , , on July 2, 2015 by xi'an

“Unfortunately, the factorization does not make it immediately clear how to aggregate on the level of samples without first having to obtain an estimate of the densities themselves.” (p.2)

The recently arXived variational consensus Monte Carlo is a paper by Maxim Rabinovich, Elaine Angelino, and Michael Jordan that approaches the consensus Monte Carlo principle from a variational perspective. As in the embarrassingly parallel version,  the target is split into a product of K terms, each being interpreted as an unnormalised density and being fed to a different parallel processor. The most natural partition is to break the data into K subsamples and to raise the prior to the power 1/K in each term. While this decomposition makes sense from a storage perspective, since each bit corresponds to a different subsample of the data, it raises the question of the statistical pertinence of splitting the prior and my feelings about it are now more lukewarm than when I commented on the embarrassingly parallel version,  mainly for the reason that it is not reparameterisation invariant—getting different targets if one does the reparameterisation before or after the partition—and hence does not treat the prior as the reference measure it should be. I therefore prefer the version where the same original prior is attached to each part of the partitioned likelihood (and even more the random subsampling approaches discussed in the recent paper of Bardenet, Doucet, and Holmes). Another difficulty with the decomposition is that a product of densities is not a density in most cases (it may even be of infinite mass) and does not offer a natural path to the analysis of samples generated from each term in the product. Nor an explanation as to why those samples should be relevant to construct a sample for the original target.

“The performance of our algorithm depends critically on the choice of aggregation function family.” (p.5)

Since the variational Bayes approach is a common answer to complex products models, Rabinovich et al. explore the use of variational Bayes techniques to build the consensus distribution out of the separate samples. As in Scott et al., and Neiswanger et al., the simulation from the consensus distribution is a transform of simulations from each of the terms in the product, e.g., a weighted average. Which determines the consensus distribution as a member of an aggregation family defined loosely by a Dirac mass. When the transform is a sum of individual terms, variational Bayes solutions get much easier to find and the authors work under this restriction… In the empirical evaluation of this variational Bayes approach as opposed to the uniform and Gaussian averaging options in Scott et al., it improves upon those, except in a mixture example with a large enough common variance.

In fine, despite the relevance of variational Bayes to improve the consensus approximation, I still remain unconvinced about the use of the product of (pseudo-)densities and the subsequent mix of simulations from those components, for the reason mentioned above and also because the tail behaviour of those components is not related with the tail behaviour of the target. Still, this is a working solution to a real problem and as such is a reference for future works.

## early rejection MCMC

Posted in Books, Statistics, University life with tags , , , , , , , , on June 16, 2014 by xi'an

In a (relatively) recent Bayesian Analysis paper on efficient MCMC algorithms for climate models, Antti Solonen, Pirkka Ollinaho, Marko Laine, Heikki Haario, Johanna Tamminen and Heikki Järvinen propose an early rejection scheme to speed up Metropolis-Hastings algorithms. The idea is to consider a posterior distribution (proportional to)

$\pi(\theta|y)= \prod_{k=1}^nL_i(\theta|y)$

such that all terms in the product are less than one and to compare the uniform u in the acceptance step of the Metropolis-Hastings algorithm to

$L_1(\theta'|y)/\pi(\theta|y),$

then, if u is smaller than the ratio, to

$L_1(\theta'|y)L_2(\theta'|y)/\pi(\theta|y),$

and so on, until the new value has been rejected or all terms have been evaluated. The scheme obviously stops earlier than the regular Metropolis-Hastings algorithm, at no significant extra cost when the product above does not factor through a sufficient statistic. Solonen et al.  suggest ordering the terms so that the computationally simpler ones are computed first. The upper bound assumption requires and is equivalent to finding the maximum on each term of the product, though, which may be costly in its own for non-standard distributions. With my students Marco Banterle and Clara Grazian, we actually came upon this paper when preparing our delayed acceptance paper as (a) it belongs to the same category of accelerated MCMC methods (delayed acceptance and early rejection are somehow synonymous!) and (b) it mentions the early prefetching papers of Brockwell (2005) and Strid (2009).

“The acceptance probability in ABC is commonly very low, and many proposals are rejected, and ER can potentially help to detect the rejections sooner.”

In the conclusion, Solonen et al. point out a possible link with ABC but, apart from the general idea of rejecting earlier by looking at a subsample or at a proxy simulation of a summary statistics, which is also the idea at the core of Dennis Prangle’s lazy ABC, there is no obvious impact on a likelihood-free method like ABC.