Archive for likelihood-free methods

holistic framework for ABC

Posted in Books, Statistics, University life with tags , , , , , , , on April 19, 2019 by xi'an

An AISTATS 2019 paper was recently arXived by Kelvin Hsu and Fabio Ramos. Proposing an ABC method

“…consisting of (1) a consistent surrogate likelihood model that modularizes queries from simulation calls, (2) a Bayesian learning objective for hyperparameters that improves inference accuracy, and (3) a posterior surrogate density and a super-sampling inference algorithm using its closed-form posterior mean embedding.”

While this sales line sounds rather obscure to me, the authors further defend their approach against ABC-MCMC or synthetic likelihood by the points

“that (1) only one new simulation is required at each new parameter θ and (2) likelihood queries do not need to be at parameters where simulations are available.”

using a RKHS approach to approximate the likelihood or the distribution of the summary (statistic) given the parameter (value) θ. Based on the choice of a certain positive definite kernel. (As usual, I do not understand why RKHS would do better than another non-parametric approach, especially since the approach approximates the full likelihood, but I am not a non-parametrician…)

“The main advantage of using an approximate surrogate likelihood surrogate model is that it readily provides a marginal surrogate likelihood quantity that lends itself to a hyper-parameter learning algorithm”

The tolerance ε (and other cyberparameters) are estimated by maximising the approximated marginal likelihood, which happens to be available in the convenient case the prior is an anisotropic Gaussian distribution. For the simulated data in the reference table? But then missing the need for localising the simulations near the posterior? Inference is then conducting by simulating from this approximation. With the common (to RKHS) drawback that the approximation is “bounded and normalized but potentially non-positive”.

prepaid ABC

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , on January 16, 2019 by xi'an

Merijn Mestdagha, Stijn Verdoncka, Kristof Meersa, Tim Loossensa, and Francis Tuerlinckx from the KU Leuven, some of whom I met during a visit to its Wallon counterpart Louvain-La-Neuve, proposed and arXived a new likelihood-free approach based on saving simulations on a large scale for future users. Future users interested in the same model. The very same model. This makes the proposal quite puzzling as I have no idea as to when situations with exactly the same experimental conditions, up to the sample size, repeat over and over again. Or even just repeat once. (Some particular settings may accommodate for different sample sizes and the same prepaid database, but others as in genetics clearly do not.) I am sufficiently puzzled to suspect I have missed the message of the paper.

“In various fields, statistical models of interest are analytically intractable. As a result, statistical inference is greatly hampered by computational constraint s. However, given a model, different users with different data are likely to perform similar computations. Computations done by one user are potentially useful for other users with different data sets. We propose a pooling of resources across researchers to capitalize on this. More specifically, we preemptively chart out the entire space of possible model outcomes in a prepaid database. Using advanced interpolation techniques, any individual estimation problem can now be solved on the spot. The prepaid method can easily accommodate different priors as well as constraints on the parameters. We created prepaid databases for three challenging models and demonstrate how they can be distributed through an online parameter estimation service. Our method outperforms state-of-the-art estimation techniques in both speed (with a 23,000 to 100,000-fold speed up) and accuracy, and is able to handle previously quasi inestimable models.”

I foresee potential difficulties with this proposal, like compelling all future users to rely on the same summary statistics, on the same prior distributions (the “representative amount of parameter values”), and requiring a massive storage capacity. Plus furthermore relying at its early stage on the most rudimentary form of an ABC algorithm (although not acknowledged as such), namely the rejection one. When reading the description in the paper, the proposed method indeed selects the parameters (simulated from a prior or a grid) that are producing pseudo-observations that are closest to the actual observations (or their summaries s). The subsample thus constructed is used to derive a (local) non-parametric or machine-learning predictor s=f(θ). From which a point estimator is deduced by minimising in θ a deviance d(s⁰,f(θ)).

The paper does not expand much on the theoretical justifications of the approach (including the appendix that covers a formal situation where the prepaid grid conveniently covers the observed statistics). And thus does not explain on which basis confidence intervals should offer nominal coverage for the prepaid method. Instead, the paper runs comparisons with Simon Wood’s (2010) synthetic likelihood maximisation (Ricker model with three parameters), the rejection ABC algorithm (species dispersion trait model with four parameters), while the Leaky Competing Accumulator (with four parameters as well) seemingly enjoys no alternative. Which is strange since the first step of the prepaid algorithm is an ABC step, but I am unfamiliar with this model. Unsurprisingly, in all these cases, given that the simulation has been done prior to the computing time for the prepaid method and not for either synthetic likelihood or ABC, the former enjoys a massive advantage from the start.

“The prepaid method can be used for a very large number of observations, contrary to the synthetic likelihood or ABC methods. The use of very large simulated data sets allows investigation of large-sample properties of the estimator”

To return to the general proposal and my major reservation or misunderstanding, for different experiments, the (true or pseudo-true) value of the parameter will not be the same, I presume, and hence the region of interest [or grid] will differ. While, again, the computational gain is de facto obvious [since the costly production of the reference table is not repeated], and, to repeat myself, makes the comparison with methods that do require a massive number of simulations from scratch massively in favour of the prepaid option, I do not see a convenient way of recycling these prepaid simulations for another setting, that is, when some experimental factors, sample size or collection, or even just the priors, do differ. Again, I may be missing the point, especially in a specific context like repeated psychological experiments.

While this may have some applications in reproducibility (but maybe not, if the goal is in fact to detect cherry-picking), I see very little use in repeating the same statistical model on different datasets. Even repeating observations will require additional nuisance parameters and possibly perturb the likelihood and/or posterior to large extents.

a book and three chapters on ABC

Posted in Statistics with tags , , , , , , , , , , on January 9, 2019 by xi'an

In connection with our handbook on mixtures being published, here are three chapters I contributed to from the Handbook of ABC, edited by Scott Sisson, Yanan Fan, and Mark Beaumont:

6. Likelihood-free Model Choice, by J.-M. Marin, P. Pudlo, A. Estoup and C.P. Robert

12. Approximating the Likelihood in ABC, by  C. C. Drovandi, C. Grazian, K. Mengersen and C.P. Robert

17. Application of ABC to Infer about the Genetic History of Pygmy Hunter-Gatherers Populations from Western Central Africa, by A. Estoup, P. Verdu, J.-M. Marin, C. Robert, A. Dehne-Garcia, J.-M. Cornuet and P. Pudlo

Implicit maximum likelihood estimates

Posted in Statistics with tags , , , , , , , , , , on October 9, 2018 by xi'an

An ‘Og’s reader pointed me to this paper by Li and Malik, which made it to arXiv after not making it to NIPS. While the NIPS reviews were not particularly informative and strongly discordant, the authors point out in the comments that they are available for the sake of promoting discussion. (As made clear in earlier posts, I am quite supportive of this attitude! Disclaimer: I was not involved in an evaluation of this paper, neither for NIPS nor for another conference or journal!!) Although the paper does not seem to mention ABC in the setting of implicit likelihoods and generative models, there is a reference to the early (1984) paper by Peter Diggle and Richard Gratton that is often seen as the ancestor of ABC methods. The authors point out numerous issues with solutions proposed for parameter estimation in such implicit models. For instance, for GANs, they signal that “minimizing the Jensen-Shannon divergence or the Wasserstein distance between the empirical data distribution and the model distribution does not necessarily minimize the same between the true data distribution and the model distribution.” (Not mentioning the particular difficulty with Bayesian GANs.) Their own solution is the implicit maximum likelihood estimator, which picks the value of the parameter θ bringing a simulated sample the closest to the observed sample. Closest in the sense of the Euclidean distance between both samples. Or between the minimum of several simulated samples and the observed sample. (The modelling seems to imply the availability of n>1 observed samples.) They advocate using a stochastic gradient descent approach for finding the optimal parameter θ which presupposes that the dependence between θ and the simulated samples is somewhat differentiable. (And this does not account for using a min, which would make differentiation close to impossible.) The paper then meanders in a lengthy discussion as to whether maximising the likelihood makes sense, with a rather naïve view on why using the empirical distribution in a Kullback-Leibler divergence does not make sense! What does not make sense is considering the finite sample approximation to the Kullback-Leibler divergence with the true distribution in my opinion.

ABC in print

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , on September 5, 2018 by xi'an

The CRC Press Handbook of ABC is now out, after a rather long delay [the first version of our model choice chapter was written in 2015!] due to some late contributors Which is why I did not spot it at JSM 2018. As announced a few weeks ago, our Handbook of Mixture Analysis is soon to be published as well. (Not that I necessarily advocate the individual purchase of these costly volumes!, especially given most chapters are available on-line.)

ABC for vampires

Posted in Books, pictures, Statistics, University life with tags , , , , , on September 4, 2018 by xi'an

Ritabrata Dutta (Warwick), along with coauthors including Anto Mira, published last week a paper in frontiers in physiology about using ABC for deriving the posterior distribution of the parameters of a dynamic blood (platelets) deposition model constructed by Bastien Chopard, the second author. While based on only five parameters, the model does not enjoy a closed form likelihood and even the simulation of a new platelet deposit takes about 10 minutes. The paper uses the simulated annealing ABC version, due to Albert, Künsch, and Scheidegger (2014), which relies a sequence of Metropolis kernels, associated with a decreasing sequence of tolerances, and claims better efficiency at reaching a stable solution. It also relies on the package abcpy, written by Ritabrata Dutta, in Python, for various aspects of ABC analysis. One feature of interest is the use of 24 summary statistics to conduct the inference on the 5 model parameters, a ratio of 24 to 5 that possibly gets improved by a variable selection tool such as random forests. Which would also avoid the choice of a specific loss function called the Bhattacharya distance (which sounds like entropy distance for the normal case).

Bayesian gan [gan style]

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , on June 26, 2018 by xi'an

In their paper Bayesian GANS, arXived a year ago, Saatchi and Wilson consider a Bayesian version of generative adversarial networks, putting priors on both the model and the discriminator parameters. While the prospect seems somewhat remote from genuine statistical inference, if the following statement is representative

“GANs transform white noise through a deep neural network to generate candidate samples from a data distribution. A discriminator learns, in a supervised manner, how to tune its parameters so as to correctly classify whether a given sample has come from the generator or the true data distribution. Meanwhile, the generator updates its parameters so as to fool the discriminator. As long as the generator has sufficient capacity, it can approximate the cdf inverse-cdf composition required to sample from a data distribution of interest.”

I figure the concept can also apply to a standard statistical model, where x=G(z,θ) rephrases the distributional assumption x~F(x;θ) via a white noise z. This makes resorting to a prior distribution on θ more relevant in the sense of using potential prior information on θ (although the successes of probabilistic numerics show formal priors can be used on purely numerical ground).

The “posterior distribution” that is central to the notion of Bayesian GANs is however unorthodox in that the distribution is associated with the following conditional posteriors

where D(x,θ) is the “discriminator”, that is, in GAN lingo, the probability to be allocated to the “true” data generating mechanism rather than to the one associated with G(·,θ). The generative conditional posterior (1) then aims at fooling the discriminator, i.e. favours generative parameter values that raise the probability of wrong allocation of the pseudo-data. The discriminative conditional posterior (2) is a standard Bayesian posterior based on the original sample and the generated sample. The authors then iteratively sample from these posteriors, effectively implementing a two-stage Gibbs sampler.

“By iteratively sampling from (1) and (2) at every step of an epoch one can, in the limit, obtain samples from the approximate posteriors over [both sets of parameters].”

What worries me about this approach is that  just cannot work, in the sense that (1) and (2) cannot be compatible conditional (posterior) distributions. There is no joint distribution for which (1) and (2) would be the conditionals, since the pseudo-data appears in D for (1) and (1-D) in (2). This means that the convergence of a Gibbs sampler is at best to a stationary σ-finite measure. And hence that the meaning of the chain is delicate to ascertain… Am I missing any fundamental point?! [I checked the reviews on NIPS webpage and could not spot this issue being raised.]