Archive for likelihood-free methods

likelihood-free inference by ratio estimation

Posted in Books, Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , on September 9, 2019 by xi'an

“This approach for posterior estimation with generative models mirrors the approach of Gutmann and Hyvärinen (2012) for the estimation of unnormalised models. The main difference is that here we classify between two simulated data sets while Gutmann and Hyvärinen (2012) classified between the observed data and simulated reference data.”

A 2018 arXiv posting by Owen Thomas et al. (including my colleague at Warwick, Rito Dutta, CoI warning!) about estimating the likelihood (and the posterior) when it is intractable. Likelihood-free but not ABC, since the ratio likelihood to marginal is estimated in a non- or semi-parametric (and biased) way. Following Geyer’s 1994 fabulous estimate of an unknown normalising constant via logistic regression, the current paper which I read in preparation for my discussion in the ABC optimal design in Salzburg uses probabilistic classification and an exponential family representation of the ratio. Opposing data from the density and data from the marginal, assuming both can be readily produced. The logistic regression minimizing the asymptotic classification error is the logistic transform of the log-ratio. For a finite (double) sample, this minimization thus leads to an empirical version of the ratio. Or to a smooth version if the log-ratio is represented as a convex combination of summary statistics, turning the approximation into an exponential family,  which is a clever way to buckle the buckle towards ABC notions. And synthetic likelihood. Although with a difference in estimating the exponential family parameters β(θ) by minimizing the classification error, parameters that are indeed conditional on the parameter θ. Actually the paper introduces a further penalisation or regularisation term on those parameters β(θ), which could have been processed by Bayesian Lasso instead. This step is essentially dirving the selection of the summaries, except that it is for each value of the parameter θ, at the expense of a X-validation step. This is quite an original approach, as far as I can tell, but I wonder at the link with more standard density estimation methods, in particular in terms of the precision of the resulting estimate (and the speed of convergence with the sample size, if convergence there is).

likelihood-free Bayesian design [SimStat 2019 discussion]

Posted in Statistics with tags , , , , , , , , , , on September 5, 2019 by xi'an

O’Bayes 19/2

Posted in Books, pictures, Running, Travel, University life with tags , , , , , , , , , , , , , , , , , on July 1, 2019 by xi'an

One talk on Day 2 of O’Bayes 2019 was by Ryan Martin on data dependent priors (or “priors”). Which I have already discussed in this blog. Including the notion of a Gibbs posterior about quantities that “are not always defined through a model” [which is debatable if one sees it like part of a semi-parametric model]. Gibbs posterior that is built through a pseudo-likelihood constructed from the empirical risk, which reminds me of Bissiri, Holmes and Walker. Although requiring a prior on this quantity that is  not part of a model. And is not necessarily a true posterior and not necessarily with the same concentration rate as a true posterior. Constructing a data-dependent distribution on the parameter does not necessarily mean an interesting inference and to keep up with the theme of the conference has no automated claim to [more] “objectivity”.

And after calling a prior both Beauty and The Beast!, Erlis Ruli argued about a “bias-reduction” prior where the prior is solution to a differential equation related with some cumulants, connected with an earlier work of David Firth (Warwick).  An interesting conundrum is how to create an MCMC algorithm when the prior is that intractable, with a possible help from PDMP techniques like the Zig-Zag sampler.

While Peter Orbanz’ talk was centred on a central limit theorem under group invariance, further penalised by being the last of the (sun) day, Peter did a magnificent job of presenting the result and motivating each term. It reminded me of the work Jim Bondar was doing in Ottawa in the 1980’s on Haar measures for Bayesian inference. Including the notion of amenability [a term due to von Neumann] I had not met since then. (Neither have I met Jim since the last summer I spent in Carleton.) The CLT and associated LLN are remarkable in that the average is not over observations but over shifts of the same observation under elements of a sub-group of transformations. I wondered as well at the potential connection with the Read Paper of Kong et al. in 2003 on the use of group averaging for Monte Carlo integration [connection apart from the fact that both discussants, Michael Evans and myself, are present at this conference].

A precursor of ABC-Gibbs

Posted in Books, R, Statistics with tags , , , , , , , , , , on June 7, 2019 by xi'an

Following our arXival of ABC-Gibbs, Dennis Prangle pointed out to us a 2016 paper by Athanasios Kousathanas, Christoph Leuenberger, Jonas Helfer, Mathieu Quinodoz, Matthieu Foll, and Daniel Wegmann, Likelihood-Free Inference in High-Dimensional Model, published in Genetics, Vol. 203, 893–904 in June 2016. This paper contains a version of ABC Gibbs where parameters are sequentially simulated from conditionals that depend on the data only through small dimension conditionally sufficient statistics. I had actually blogged about this paper in 2015 but since then completely forgotten about it. (The comments I had made at the time still hold, already pertaining to the coherence or lack thereof of the sampler. I had also forgotten I had run an experiment of an exact Gibbs sampler with incoherent conditionals, which then seemed to converge to something, if not the exact posterior.)

All ABC algorithms, including ABC-PaSS introduced here, require that statistics are sufficient for estimating the parameters of a given model. As mentioned above, parameter-wise sufficient statistics as required by ABC-PaSS are trivial to find for distributions of the exponential family. Since many population genetics models do not follow such distributions, sufficient statistics are known for the most simple models only. For more realistic models involving multiple populations or population size changes, only approximately-sufficient statistics can be found.

While Gibbs sampling is not mentioned in the paper, this is indeed a form of ABC-Gibbs, with the advantage of not facing convergence issues thanks to the sufficiency. The drawback being that this setting is restricted to exponential families and hence difficult to extrapolate to non-exponential distributions, as using almost-sufficient (or not) summary statistics leads to incompatible conditionals and thus jeopardise the convergence of the sampler. When thinking a wee bit more about the case treated by Kousathanas et al., I am actually uncertain about the validation of the sampler. When tolerance is equal to zero, this is not an issue as it reproduces the regular Gibbs sampler. Otherwise, each conditional ABC step amounts to introducing an auxiliary variable represented by the simulated summary statistic. Since the distribution of this summary statistic depends on more than the parameter for which it is sufficient, in general, it should also appear in the conditional distribution of other parameters. At least from this Gibbs perspective, it thus relies on incompatible conditionals, which makes the conditions proposed in our own paper the more relevant.

holistic framework for ABC

Posted in Books, Statistics, University life with tags , , , , , , , on April 19, 2019 by xi'an

An AISTATS 2019 paper was recently arXived by Kelvin Hsu and Fabio Ramos. Proposing an ABC method

“…consisting of (1) a consistent surrogate likelihood model that modularizes queries from simulation calls, (2) a Bayesian learning objective for hyperparameters that improves inference accuracy, and (3) a posterior surrogate density and a super-sampling inference algorithm using its closed-form posterior mean embedding.”

While this sales line sounds rather obscure to me, the authors further defend their approach against ABC-MCMC or synthetic likelihood by the points

“that (1) only one new simulation is required at each new parameter θ and (2) likelihood queries do not need to be at parameters where simulations are available.”

using a RKHS approach to approximate the likelihood or the distribution of the summary (statistic) given the parameter (value) θ. Based on the choice of a certain positive definite kernel. (As usual, I do not understand why RKHS would do better than another non-parametric approach, especially since the approach approximates the full likelihood, but I am not a non-parametrician…)

“The main advantage of using an approximate surrogate likelihood surrogate model is that it readily provides a marginal surrogate likelihood quantity that lends itself to a hyper-parameter learning algorithm”

The tolerance ε (and other cyberparameters) are estimated by maximising the approximated marginal likelihood, which happens to be available in the convenient case the prior is an anisotropic Gaussian distribution. For the simulated data in the reference table? But then missing the need for localising the simulations near the posterior? Inference is then conducting by simulating from this approximation. With the common (to RKHS) drawback that the approximation is “bounded and normalized but potentially non-positive”.

prepaid ABC

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , on January 16, 2019 by xi'an

Merijn Mestdagha, Stijn Verdoncka, Kristof Meersa, Tim Loossensa, and Francis Tuerlinckx from the KU Leuven, some of whom I met during a visit to its Wallon counterpart Louvain-La-Neuve, proposed and arXived a new likelihood-free approach based on saving simulations on a large scale for future users. Future users interested in the same model. The very same model. This makes the proposal quite puzzling as I have no idea as to when situations with exactly the same experimental conditions, up to the sample size, repeat over and over again. Or even just repeat once. (Some particular settings may accommodate for different sample sizes and the same prepaid database, but others as in genetics clearly do not.) I am sufficiently puzzled to suspect I have missed the message of the paper.

“In various fields, statistical models of interest are analytically intractable. As a result, statistical inference is greatly hampered by computational constraint s. However, given a model, different users with different data are likely to perform similar computations. Computations done by one user are potentially useful for other users with different data sets. We propose a pooling of resources across researchers to capitalize on this. More specifically, we preemptively chart out the entire space of possible model outcomes in a prepaid database. Using advanced interpolation techniques, any individual estimation problem can now be solved on the spot. The prepaid method can easily accommodate different priors as well as constraints on the parameters. We created prepaid databases for three challenging models and demonstrate how they can be distributed through an online parameter estimation service. Our method outperforms state-of-the-art estimation techniques in both speed (with a 23,000 to 100,000-fold speed up) and accuracy, and is able to handle previously quasi inestimable models.”

I foresee potential difficulties with this proposal, like compelling all future users to rely on the same summary statistics, on the same prior distributions (the “representative amount of parameter values”), and requiring a massive storage capacity. Plus furthermore relying at its early stage on the most rudimentary form of an ABC algorithm (although not acknowledged as such), namely the rejection one. When reading the description in the paper, the proposed method indeed selects the parameters (simulated from a prior or a grid) that are producing pseudo-observations that are closest to the actual observations (or their summaries s). The subsample thus constructed is used to derive a (local) non-parametric or machine-learning predictor s=f(θ). From which a point estimator is deduced by minimising in θ a deviance d(s⁰,f(θ)).

The paper does not expand much on the theoretical justifications of the approach (including the appendix that covers a formal situation where the prepaid grid conveniently covers the observed statistics). And thus does not explain on which basis confidence intervals should offer nominal coverage for the prepaid method. Instead, the paper runs comparisons with Simon Wood’s (2010) synthetic likelihood maximisation (Ricker model with three parameters), the rejection ABC algorithm (species dispersion trait model with four parameters), while the Leaky Competing Accumulator (with four parameters as well) seemingly enjoys no alternative. Which is strange since the first step of the prepaid algorithm is an ABC step, but I am unfamiliar with this model. Unsurprisingly, in all these cases, given that the simulation has been done prior to the computing time for the prepaid method and not for either synthetic likelihood or ABC, the former enjoys a massive advantage from the start.

“The prepaid method can be used for a very large number of observations, contrary to the synthetic likelihood or ABC methods. The use of very large simulated data sets allows investigation of large-sample properties of the estimator”

To return to the general proposal and my major reservation or misunderstanding, for different experiments, the (true or pseudo-true) value of the parameter will not be the same, I presume, and hence the region of interest [or grid] will differ. While, again, the computational gain is de facto obvious [since the costly production of the reference table is not repeated], and, to repeat myself, makes the comparison with methods that do require a massive number of simulations from scratch massively in favour of the prepaid option, I do not see a convenient way of recycling these prepaid simulations for another setting, that is, when some experimental factors, sample size or collection, or even just the priors, do differ. Again, I may be missing the point, especially in a specific context like repeated psychological experiments.

While this may have some applications in reproducibility (but maybe not, if the goal is in fact to detect cherry-picking), I see very little use in repeating the same statistical model on different datasets. Even repeating observations will require additional nuisance parameters and possibly perturb the likelihood and/or posterior to large extents.

a book and three chapters on ABC

Posted in Statistics with tags , , , , , , , , , , on January 9, 2019 by xi'an

In connection with our handbook on mixtures being published, here are three chapters I contributed to from the Handbook of ABC, edited by Scott Sisson, Yanan Fan, and Mark Beaumont:

6. Likelihood-free Model Choice, by J.-M. Marin, P. Pudlo, A. Estoup and C.P. Robert

12. Approximating the Likelihood in ABC, by  C. C. Drovandi, C. Grazian, K. Mengersen and C.P. Robert

17. Application of ABC to Infer about the Genetic History of Pygmy Hunter-Gatherers Populations from Western Central Africa, by A. Estoup, P. Verdu, J.-M. Marin, C. Robert, A. Dehne-Garcia, J.-M. Cornuet and P. Pudlo