## asymptotically exact inference in likelihood-free models [a reply from the authors]

Posted in R, Statistics with tags , , , , , , , , , , , , , , , , , on December 1, 2016 by xi'an

[Following my post of lastTuesday, Matt Graham commented on the paper with force détails. Here are those comments. A nicer HTML version of the Markdown reply below is also available on Github.]

Thanks for the comments on the paper!

A few additional replies to augment what Amos wrote:

This however sounds somewhat intense in that it involves a quasi-Newton resolution at each step.

The method is definitely computationally expensive. If the constraint function is of the form of a function from an M-dimensional space to an N-dimensional space, with MN, for large N the dominant costs at each timestep are usually the constraint Jacobian (c/u) evaluation (with reverse-mode automatic differentiation this can be evaluated at a cost of O(N) generator / constraint evaluations) and Cholesky decomposition of the Jacobian product (c/u)(c/u) with O(N³) cost (though in many cases e.g. i.i.d. or Markovian simulated data, structure in the generator Jacobian can be exploited to give a significantly reduced cost). Each inner Quasi-Newton update involves a pair of triangular solve operations which have a O(N²) cost, two matrix-vector multiplications with O(MN) cost, and a single constraint / generator function evaluation; the number of Quasi-Newton updates required for convergence in the numerical experiments tended to be much less than N hence the Quasi-Newton iteration tended not to be the main cost.

The high computation cost per update is traded off however with often being able to make much larger proposed moves in high-dimensional state spaces with a high chance of acceptance compared to ABC MCMC approaches. Even in the relatively small Lotka-Volterra example we provide which has an input dimension of 104 (four inputs which map to ‘parameters’, and 100 inputs which map to ‘noise’ variables), the ABC MCMC chains using the coarse ABC kernel radius ϵ=100 with comparably very cheap updates were significantly less efficient in terms of effective sample size / computation time than the proposed constrained HMC approach. This was in large part due to the elliptical slice sampling updates in the ABC MCMC chains generally collapsing down to very small moves even for this relatively coarse ϵ. Performance was even worse using non-adaptive ABC MCMC methods and for smaller ϵ, and for higher input dimensions (e.g. using a longer sequence with correspondingly more random inputs) the comparison becomes even more favourable for the constrained HMC approach. Continue reading

## automatic variational ABC

Posted in pictures, Statistics with tags , , , , , , , , , , on July 8, 2016 by xi'an

“Stochastic Variational inference is an appealing alternative to the inefficient sampling approaches commonly used in ABC.”

Moreno et al. [including Ted Meeds and Max Welling] recently arXived a paper merging variational inference and ABC. The argument for turning variational is computational speedup. The traditional (in variational inference) divergence decomposition of the log-marginal likelihood is replaced by an ABC version, parameterised in terms of intrinsic generators (i.e., generators that do not depend on cyber-parameters, like the U(0,1) or the N(0,1) generators). Or simulation code in the authors’ terms. Which leads to the automatic aspect of the approach. In the paper the derivation of the gradient is indeed automated.

“One issue is that even assuming that the ABC likelihood is an unbiased estimator of the true likelihood (which it is not), taking the log introduces a bias, so that we now have a biased estimate of the lower bound and thus biased gradients.”

I wonder how much of an issue this is, since we consider the variational lower bound. To be optimised in terms of the parameters of the variational posterior. Indeed, the endpoint of the analysis is to provide an optimal variational approximation, which remains an approximation whether or not the likelihood estimator is unbiased. A more “severe” limitation may be in the inversion constraint, since it seems to eliminate Beta or Gamma distributions. (Even though calling qbeta(runif(1),a,b) definitely is achievable… And not rejected by a Kolmogorov-Smirnov test.)

Incidentally, I discovered through the paper the existence of the Kumaraswamy distribution, which main appeal seems to be the ability to produce a closed-form quantile function, while bearing some resemblance with the Beta distribution. (Another arXival by Baltasar Trancón y Widemann studies some connections between those, but does not tell how to select the parameters to optimise the similarity.)

## Bayesian Indirect Inference and the ABC of GMM

Posted in Books, Statistics, University life with tags , , , , , , , , , , on February 17, 2016 by xi'an

“The practicality of estimation of a complex model using ABC is illustrated by the fact that we have been able to perform 2000 Monte Carlo replications of estimation of this simple DSGE model, using a single 32 core computer, in less than 72 hours.” (p.15)

Earlier this week, Michael Creel and his coauthors arXived a long paper with the above title, where ABC relates to approximate Bayesian computation. In short, this paper provides deeper theoretical foundations for the local regression post-processing of Mark Beaumont and his coauthors (2002). And some natural extensions. But apparently considering one univariate transform η(θ) of interest at a time. The theoretical validation of the method is that the resulting estimators converge at speed √n under some regularity assumptions. Including the identifiability of the parameter θ in the mean of the summary statistics T, which relates to our consistency result for ABC model choice. And a CLT on an available (?) preliminary estimator of η(θ).

The paper also includes a GMM version of ABC which appeal is less clear to me as it seems to rely on a preliminary estimator of the univariate transform of interest η(θ). Which is then randomized by a normal random walk. While this sounds a wee bit like noisy ABC, it differs from this generic approach as the model is not assumed to be known, but rather available through an asymptotic Gaussian approximation. (When the preliminary estimator is available in closed form, I do not see the appeal of adding this superfluous noise. When it is unavailable, it is unclear why a normal perturbation can be produced.)

“[In] the method we study, the estimator is consistent, asymptotically normal, and asymptotically as efficient as a limited information maximum likelihood estimator. It does not require either optimization, or MCMC, or the complex evaluation of the likelihood function.” (p.3)

Overall, I have trouble relating the paper to (my?) regular ABC in that the outcome of the supported procedures is an estimator rather than a posterior distribution. Those estimators are demonstrably endowed with convergence properties, including quantile estimates that can be exploited for credible intervals, but this does not produce a posterior distribution in the classical Bayesian sense. For instance, how can one run model comparison in this framework? Furthermore, each of those inferential steps requires solving another possibly costly optimisation problem.

“Posterior quantiles can also be used to form valid confidence intervals under correct model specification.” (p.4)

Nitpicking(ly), this statement is not correct in that posterior quantiles produce valid credible intervals and only asymptotically correct confidence intervals!

“A remedy is to choose the prior π(θ) iteratively or adaptively as functions of initial estimates of θ, so that the “prior” becomes dependent on the data, which can be denoted as π(θ|T).” (p.6)

This modification of the basic ABC scheme relying on simulation from the prior π(θ) can be found in many earlier references and the iterative construction of a better fitted importance function rather closely resembles ABC-PMC. Once again nitpicking(ly), the importance weights are defined therein (p.6) as the inverse of what they should be.

## weak convergence (…) in ABC

Posted in Books, Statistics, University life with tags , , , , , , on January 18, 2016 by xi'an

Samuel Soubeyrand and Eric Haon-Lasportes recently published a paper in Statistics and Probability Letters that has some common features with the ABC consistency paper we wrote a few months ago with David Frazier and Gael Martin. And to the recent Li and Fearnhead paper on the asymptotic normality of the ABC distribution. Their approach is however based on a Bernstein-von Mises [CLT] theorem for the MLE or a pseudo-MLE. They assume that the density of this estimator is asymptotically equivalent to a Normal density, in which case the true posterior conditional on the estimator is also asymptotically equivalent to a Normal density centred at the (p)MLE. Which also makes the ABC distribution normal when both the sample size grows to infinity and the tolerance decreases to zero. Which is not completely unexpected. However, in complex settings, establishing the asymptotic normality of the (p)MLE may prove a formidable or even impossible task.

## MCMskv #5 [future with a view]

Posted in Kids, Mountains, R, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , on January 12, 2016 by xi'an

As I am flying back to Paris (with an afternoon committee meeting in München in-between), I am reminiscing on the superlative scientific quality of this MCMski meeting, on the novel directions in computational Bayesian statistics exhibited therein, and on the potential settings for the next meeting. If any.

First, as hopefully obvious from my previous entries, I found the scientific program very exciting, with almost uniformly terrific talks, and a coverage of the field of computational Bayesian statistics that is perfectly tuned to my own interest. In that sense, MCMski is my “top one” conference! Even without considering the idyllic location. While some of the talks were about papers I had already read (and commented here), others brought new vistas and ideas. If one theme is to emerge from this meeting it has to be the one of approximate and noisy algorithms, with a wide variety of solutions and approaches to overcome complexity issues. If anything, I wish the solutions would also incorporate the Boxian fact that the statistical models themselves are approximate. Overall, a fantastic program (says one member of the scientific committee).

Second, as with previous MCMski meetings, I again enjoyed the unique ambience of the meeting, which always feels more relaxed and friendly than other conferences of a similar size, maybe because of the après-ski atmosphere or of the special coziness provided by luxurious mountain hotels. This year hotel was particularly pleasant, with non-guests like myself able to partake of some of their facilities. A big thank you to Anto for arranging so meticulously all the details of such a large meeting!!! I am even more grateful when realising this is the third time Anto takes over the heavy load of organising MCMski. Grazie mille!

Since this is a [and even the!] BayesComp conference, the current section program chair and board must decide on the  structure and schedule of the next meeting. A few suggestions if I may: I would scrap entirely the name MCMski from the next conference as (a) it may sound like academic tourism for unaware bystanders (who only need to check the program of any of the MCMski conferences to stand reassured!) and (b) its topic go way beyond MCMC. Given the large attendance and equally large proportion of young researchers, I would also advise against hosting the conference in a ski resort for both cost and accessibility reasons [as we had already discussed after MCMskiv], in favour of a large enough town to offer a reasonable range of accommodations and of travel options. Like Chamonix, Innsbruck, Reykjavik, or any place with a major airport about one hour away… If nothing is available with skiing possibilities, so be it! While the outdoor inclinations of the early organisers induced us to pick locations where skiing over lunch break was a perk, any accessible location that allows for a concentration of researchers in a small area and for the ensuing day-long exchange is fine! Among the novelties in the program, the tutorials and the Breaking news! sessions were quite successful (says one member of the scientific committee). And should be continued in one format or another. Maybe a more programming thread could be added as well… And as we had mentioned earlier, to see a stronger involvement of the Young Bayesian section in the program would be great! (Even though the current meeting already had many young researcher  talks.)

## MCMskv #3 [town with a view]

Posted in Statistics with tags , , , , , , , , , , , , , on January 8, 2016 by xi'an

Third day at MCMskv, where I took advantage of the gap left by the elimination of the Tweedie Race [second time in a row!] to complete and submit our mixture paper. Despite the nice weather. The rest of the day was quite busy with David Dunson giving a plenary talk on various approaches to approximate MCMC solutions, with a broad overview of the potential methods and of the need for better solutions. (On a personal basis, great line from David: “five minutes or four minutes?”. It almost beat David’s question on the previous day, about the weight of a finch that sounded suspiciously close to the question about the air-speed velocity of an unladen swallow. I was quite surprised the speaker did not reply with the Arthurian “An African or an European finch?”) In particular, I appreciated the notion that some problems were calling for a reduction in the number of parameters, rather than the number of observations. At which point I wrote down “multiscale approximations required” in my black pad,  a requirement David made a few minutes later. (The talk conditions were also much better than during Michael’s talk, in that the man standing between the screen and myself was David rather than the cameraman! Joke apart, it did not really prevent me from reading them, except for most of the jokes in small prints!)

The first session of the morning involved a talk by Marc Suchard, who used continued fractions to find a closed form likelihood for the SIR epidemiology model (I love continued fractions!), and a talk by Donatello Telesca who studied non-local priors to build a regression tree. While I am somewhat skeptical about non-local testing priors, I found this approach to the construction of a tree quite interesting! In the afternoon, I obviously went to the intractable likelihood session, with talks by Chris Oates on a control variate method for doubly intractable models, Brenda Vo on mixing sequential ABC with Bayesian bootstrap, and Gael Martin on our consistency paper. I was not aware of the Bayesian bootstrap proposal and need to read through the paper, as I fail to see the appeal of the bootstrap part! I later attended a session on exact Monte Carlo methods that was pleasantly homogeneous. With talks by Paul Jenkins (Warwick) on the exact simulation of the Wright-Fisher diffusion, Anthony Lee (Warwick) on designing perfect samplers for chains with atoms, Chang-han Rhee and Sebastian Vollmer on extensions of the Glynn-Rhee debiasing technique I previously discussed on the blog. (Once again, I regretted having to make a choice between the parallel sessions!)

The poster session (after a quick home-made pasta dish with an exceptional Valpolicella!) was almost universally great and with just the right number of posters to go around all of them in the allotted time. With in particular the Breaking News! posters of Giacomo Zanella (Warwick), Beka Steorts and Alexander Terenin. A high quality session that made me regret not touring the previous one due to my own poster presentation.

## scaling the Gibbs posterior credible regions

Posted in Books, Statistics, University life with tags , , , , , , , on September 11, 2015 by xi'an

“The challenge in implementation of the Gibbs posterior is that it depends on an unspecified scale (or inverse temperature) parameter.”

A new paper by Nick Syring and Ryan Martin was arXived today on the same topic as the one I discussed last January. The setting is the same as with empirical likelihood, namely that the distribution of the data is not specified, while parameters of interest are defined via moments or, more generally, a minimising a loss function. A pseudo-likelihood can then be constructed as a substitute to the likelihood, in the spirit of Bissiri et al. (2013). It is called a “Gibbs posterior” distribution in this paper. So the “Gibbs” in the title has no link with the “Gibbs” in Gibbs sampler, since inference is conducted with respect to this pseudo-posterior. Somewhat logically (!), as n grows to infinity, the pseudo- posterior concentrates upon the pseudo-true value of θ minimising the expected loss, hence asymptotically resembles to the M-estimator associated with this criterion. As I pointed out in the discussion of Bissiri et al. (2013), one major hurdle when turning a loss into a log-likelihood is that it is at best defined up to a scale factor ω. The authors choose ω so that the Gibbs posterior

$\exp\{-\omega n l_n(\theta,x) \}\pi(\theta)$

is well-calibrated. Where ln is the empirical averaged loss. So the Gibbs posterior is part of the matching prior collection. In practice the authors calibrate ω by a stochastic optimisation iterative process, with bootstrap on the side to evaluate coverage. They briefly consider empirical likelihood as an alternative, on a median regression example, where they show that their “Gibbs confidence intervals (…) are clearly the best” (p.12). Apart from the relevance of being “well-calibrated”, and the asymptotic nature of the results. and the dependence on the parameterisation via the loss function, one may also question the possibility of using this approach in large dimensional cases where all of or none of the parameters are of interest.