Archive for synthetic likelihood

can we trust computer simulations? [day #2]

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , , , , on July 13, 2015 by xi'an

Herrenhausen“Sometimes the models are better than the data.” G. Krinner

Second day at the conference on building trust in computer simulations. Starting with a highly debated issue, climate change projections. Since so many criticisms are addressed to climate models as being not only wrong but also unverifiable. And uncheckable. As explained by Gerhart Krinner, the IPCC has developed methodologies to compare models and evaluate predictions. However, from what I understood, this validation does not say anything about the future, which is the part of the predictions that matters. And that is attacked by critics and feeds climatic-skeptics. Because it is so easy to argue against the homogeneity of the climate evolution and for “what you’ve seen is not what you’ll get“! (Even though climatic-skeptics are the least likely to use this time-heterogeneity argument, being convinced as they are of the lack of human impact over the climate.)  The second talk was by Viktoria Radchuk about validation in ecology. Defined here as a test of predictions against independent data (and designs). And mentioning Simon Wood’s synthetic likelihood as the Bayesian reference for conducting model choice (as a synthetic likelihoods ratio). I had never thought of this use (found in Wood’s original paper) for synthetic likelihood, I feel a bit queasy about using a synthetic likelihood ratio as a genuine likelihood ratio. Which led to a lively discussion at the end of her talk. The next talk was about validation in economics by Matteo Richiardi, who discussed state-space models where the hidden state is observed through a summary statistic, perfect playground for ABC! But Matteo opted instead for a non-parametric approach that seems to increase imprecision and that I have never seen used in state-space models. The last part of the talk was about non-ergodic models, for which checking for validity becomes much more problematic, in my opinion. Unless one manages multiple observations of the non-ergodic path. Nicole Saam concluded this “Validation in…” morning with Validation in Sociology. With a more pessimistic approach to the possibility of finding a falsifying strategy, because of the vague nature of sociology models. For which data can never be fully informative. She illustrated the issue with an EU negotiation analysis. Where most hypotheses could hardly be tested.

“Bayesians persist with poor examples of randomness.” L. Smith

“Bayesians can be extremely reasonable.” L. Smith

The afternoon session was dedicated to methodology, mostly statistics! Andrew Robinson started with a talk on (frequentist) model validation. Called splitters and lumpers. Illustrated by a forest growth model. He went through traditional hypothesis tests like Neyman-Pearson’s that try to split between samples. And (bio)equivalence tests that take difference as the null. Using his equivalence R package. Then Leonard Smith took over [in a literal way!] from a sort-of-Bayesian perspective, in a work joint with Jim Berger and Gary Rosner on pragmatic Bayes which was mostly negative about Bayesian modelling. Introducing (to me) the compelling notion of structural model error as a representation of the inadequacy of the model. With illustrations from weather and climate models. His criticism of the Bayesian approach is that it cannot be holistic while pretending to be [my wording]. And being inadequate to measure model inadequacy, to the point of making prior choice meaningless. Funny enough, he went back to the ball dropping experiment David Higdon discussed at one JSM I attended a while ago, with the unexpected outcome that one ball did not make it to the bottom of the shaft. A more positive side was that posteriors are useful models but should not be interpreted from a probabilistic perspective. Move beyond probability was his final message. (For most of the talk, I misunderstood P(BS), the probability of a big surprise, for something else…) This was certainly the most provocative talk of the conference  and the discussion could have gone on for the rest of day! Somewhat, Lenny was voluntarily provocative in piling the responsibility upon the Bayesian’s head for being overconfident and not accounting for the physicist’ limitations in modelling the phenomenon of interest. Next talk was by Edward Dougherty on methods used in biology. He separated within-model uncertainty from outside-model inadequacy. The within model part is mostly easy to agree upon. Even though difficulties in estimating parameters creates uncertainty classes of models. Especially because of being from a small data discipline. He analysed the impact of machine learning techniques like classification as being useless without prior knowledge. And argued in favour of the Bayesian minimum mean square error estimator. Which can also lead to a classifier. And experimental design. (Using MSE seems rather reductive when facing large dimensional parameters.) Last talk of the day was by Nicolas Becu, a geographer, with a surprising approach to validation via stakeholders. A priori not too enticing a name! The discussion was of a more philosophical nature, going back to (re)define validation against reality and imperfect models. And including social aspects of validation, e.g., reality being socially constructed. This led to the stakeholders, because a model is then a shared representation. Nicolas illustrated the construction by simulation “games” of a collective model in a community of Thai farmers and in a group of water users.

In a rather unique fashion, we also had an evening discussion on points we share and points we disagreed upon. After dinner (and wine), which did not help I fear! Bill Oberkampf mentioned the use of manufactured solutions to check code, which seemed very much related to physics. But then we got mired into the necessity of dividing between verification and validation. Which sounded very and too much engineering-like to me. Maybe because I do not usually integrate coding errors and algorithmic errors into my reasoning (verification)… Although sharing code and making it available makes a big difference. Or maybe because considering all models are wrong is neither part of my methodology (validation). This part ended up in a fairly pessimistic conclusion on the lack of trust in most published articles. At least in the biological sciences.

Hamiltonian ABC

Posted in Books, pictures, Statistics, University life with tags , , , , , , , on March 13, 2015 by xi'an

On Monday, Ed Meeds, Robert Leenders, and Max Welling (from Amsterdam) arXived a paper entitled Hamiltonian ABC. Before looking at the paper in any detail, I got puzzled by this association of antagonistic terms, since ABC is intended for complex and mostly intractable likelihoods, while Hamiltonian Monte Carlo requires a lot from the target, in order to compute gradients and Hessians… [Warning: some graphs on pages 13-14 may be harmful to your printer!]

Somewhat obviously (ex-post!), the paper suggests to use Hamiltonian dynamics on ABC approximations of the likelihood. They compare a Gaussian kernel version

\frac{1}{S}\sum_{s=1}^S \varphi(y^\text{obs}-x_s(\theta);\epsilon^2)

with the synthetic Gaussian likelihood version of Wood (2010)

\varphi(y^\text{obs}-\mu(\theta);\sigma(\theta)^2+\epsilon^2)

where both mean and variance are estimated from the simulated data. If ε is taken as an external quantity and driven to zero, the second approach is much more stable. But… ε is never driven to zero in ABC, or fixed at ε=0.37: It is instead considered as a kernel bandwidth and hence estimated from the simulated data. Hence ε is commensurable with σ(θ).  And this makes me wonder at the relevance of the conclusion that synthetic is better than kernel for Hamiltonian ABC. More globally, I wonder at the relevance of better simulating from a still approximate target when the true goal is to better approximate the genuine posterior.

Some of the paper covers separate issues like handling gradient by finite differences à la Spall [if you can afford it!] and incorporating the random generator as part of the Markov chain. And using S common random numbers in computing the gradients for all values of θ. (Although I am not certain all random generators can be represented as a deterministic transform of a parameter θ and of a fixed number of random uniforms. But the authors may consider a random number of random uniforms when they represent their random generators as deterministic transform of a parameter θ and of the random seed. I am also uncertain about the distinction between common, sticky, and persistent random numbers!)

Inference for stochastic simulation models by ABC

Posted in Books, Statistics, University life with tags , , , , , on February 13, 2015 by xi'an

Hartig et al. published a while ago (2011) a paper  in Ecology Letters entitled “Statistical inference for stochastic simulation models – theory and application”, which is mostly about ABC. (Florian Hartig pointed out the paper to me in a recent blog comment. about my discussion of the early parts of Guttman and Corander’s paper.) The paper is largely a tutorial and it reminds the reader about related methods like indirect inference and methods of moments. The authors also insist on presenting ABC as a particular case of likelihood approximation, whether non-parametric or parametric. Making connections with pseudo-likelihood and pseudo-marginal approaches. And including a discussion of the possible misfit of the assumed model, handled by an external error model. And also introducing the notion of informal likelihood (which could have been nicely linked with empirical likelihood). A last class of approximations presented therein is called rejection filters and reminds me very much of Ollie Ratman’s papers.

“Our general aim is to find sufficient statistics that are as close to minimal sufficiency as possible.” (p.819)

As in other ABC papers, and as often reported on this blog, I find the stress on sufficiency a wee bit too heavy as those models calling for approximation almost invariably do not allow for any form of useful sufficiency. Hence the mathematical statistics notion of sufficiency is mostly useless in such settings.

“A basic requirement is that the expectation value of the point-wise approximation of p(Sobs|φ) must be unbiased” (p.823)

As stated above the paper is mostly in tutorial mode, for instance explaining what MCMC and SMC methods are. As illustrated by the above figure. There is however a final and interesting discussion section on the impact of estimating the likelihood function at different values of the parameter. However, the authors seem to focus solely on pseudo-marginal results to validate this approximation, hence on unbiasedness, which does not work for most ABC approaches that I know. And for the approximations listed in the survey. Actually, it would be quite beneficial to devise a cheap tool to assess the bias or extra-variation due to the use of approximative techniques like ABC… A sort of 21st Century bootstrap?!

Bayesian optimization for likelihood-free inference of simulator-based statistical models

Posted in Books, Statistics, University life with tags , , , , on January 29, 2015 by xi'an

Michael Gutmann and Jukka Corander arXived this paper two weeks ago. I read part of it (mostly the extended introduction part) on the flight from Edinburgh to Birmingham this morning. I find the reflection it contains on the nature of the ABC approximation quite deep and thought-provoking.  Indeed, the major theme of the paper is to visualise ABC (which is admittedly shorter than “likelihood-free inference of simulator-based statistical models”!) as a regular computational method based on an approximation of the likelihood function at the observed value, yobs. This includes for example Simon Wood’s synthetic likelihood (who incidentally gave a talk on his method while I was in Oxford). As well as non-parametric versions. In both cases, the approximations are based on repeated simulations of pseudo-datasets for a given value of the parameter θ, either to produce an estimation of the mean and covariance of the sampling model as a function of θ or to construct genuine estimates of the likelihood function. As assumed by the authors, this calls for a small dimension θ. This approach actually allows for the inclusion of the synthetic approach as a lower bound on a non-parametric version.

In the case of Wood’s synthetic likelihood, two questions came to me:

  • the estimation of the mean and covariance functions is usually not smooth because new simulations are required for each new value of θ. I wonder how frequent is the case where we can always use the same basic random variates for all values of θ. Because it would then give a smooth version of the above. In the other cases, provided the dimension is manageable, a Gaussian process could be first fitted before using the approximation. Or any other form of regularization.
  • no mention is made [in the current paper] of the impact of the parametrization of the summary statistics. Once again, a Cox transform could be applied to each component of the summary for a better proximity of/to the normal distribution.

When reading about a non-parametric approximation to the likelihood (based on the summaries), the questions I scribbled on the paper were:

  • estimating a complete density when using this estimate at the single point yobs could possibly be superseded by a more efficient approach.
  • the authors study a kernel that is a function of the difference or distance between the summaries and which is maximal at zero. This is indeed rather frequent in the ABC literature, but does it impact the convergence properties of the kernel estimator?
  • the estimation of the tolerance, which happens to be a bandwidth in that case, does not appear to be processed in this paper, which could explain for very low probabilities of acceptance mentioned in the paper.
  • I am lost as to why lower bounds on likelihoods are relevant here. Unless this is intended for ABC maximum likelihood estimation.

Guttmann and Corander also comment on the first point, through the cost of producing a likelihood estimator. They therefore suggest to resort to regression and to avoid regions of low estimated likelihood. And rely on Bayesian optimisation. (Hopefully to be commented later.)

Bayesian indirect inference [a response]

Posted in Books, Statistics, Travel, University life with tags , , , , , , on February 18, 2014 by xi'an

Zeeman Building, University of WarwickThis Bayesian indirect inference paper by Chris Drovandi and Tony Pettitt was discussed on the ‘Og two weeks ago and Chris sent me the following comments.

unsurprisingly, the performances of ABC comparing true data of size n with synthetic data of size m>n are not great. However, there exists another way of reducing the variance in the synthetic data, namely by repeating simulations of samples of size n and averaging the indicators for proximity, resulting in a frequency rather than a 0-1 estimator. See e.g. Del Moral et al. (2009). In this sense, increasing the computing power reduces the variability of the ABC approximation. (And I thus fail to see the full relevance of Result 1.)

Taking the average of the indicators from multiple simulations will reduce the variability of the estimated ABC likelihood but because it is only still an unbiased estimate it will not alter the target and will not improve the ABC approximation (Andrieu and Roberts 2009).  It will only have the effect of improving the mixing of MCMC ABC.  Result 1 is used to contrast ABC II and BIL as they behave quite differently as n is increased.

The authors make several assumptions of unicity that I somewhat find unclear. While assuming that the MLE for the auxiliary model is unique could make sense (Assumption 2), I do not understand the corresponding indexing of this estimator (of the auxiliary parameter) on the generating (model) parameter θ. It should only depend on the generated/simulated data x. The notion of a noisy mapping is just confusing to me.

The dependence on θ is a little confusing I agree (especially in the context of ABC II methods).  It starts to become more clear in the context of BIL.  As n goes to infinity, the effect of the simulated data is removed and then we obtain the function φ(θ) (so we need to remember which θ simulated the data), which is referred to as the mapping or binding function in the II literature.  If we somehow knew the binding function, BIL would proceed straightforwardly.  But of course we don’t in practice, so we try to estimate it via simulated data (which, for computational reasons, needs to be a finite sample) from the true model based on theta.  Thus we obtain a noisy estimate of the mapping.  One way forward might be to fit some (non-parametric?) regression model to smooth out the noise and try to recover the true mapping (without ever taking n to infinity) and run a second BIL with this estimated mapping.  I plan to investigate this in future work.

The assumption that the auxiliary score function at the auxiliary MLE for the observed data and for a simulated dataset (Assumption 3) is unique proceeds from the same spirit. I however fail to see why it matters so much. If the auxiliary MLE is the result of a numerical optimisation algorithm, the numerical algorithm may return local modes. This only adds to the approximative effect of the ABC-I schemes.

The optimiser failing to find the MLE (local mode) is certainly an issue shared by all BII methods, apart from ABC IS (which only requires 1 optimisation, so more effort to find the MLE can be applied here).  Assuming the optimiser can obtain the MLE, I think the uniqueness assumptions makes sense.  It basically says that, for a particular simulated dataset we would like a unique value for the ABC discrepancy function.

Given that the paper does not produce convergence results for those schemes, unless the auxiliary model contains the genuine model, such theoretical assumptions do not feel that necessary.

Actually, the ABC II methods will never converge to the true posterior (in general) due to lack of sufficiency.  This is even the case if the true model is a special case of the auxiliary model! (in which case BIL can converge to the true posterior)

The paper uses normal mixtures as an auxiliary model: the multimodality of this model should not be such an hindrance (and reordering is transparent, i.e. does not “reduce the flexibility of the auxiliary model”, and does not “increase the difficulty of implementation”, as stated p.16).

Thanks for your comment.  I need to think about this more as I am not an expert on mixture modelling.  The standard EM algorithm in Matlab does not apply any ordering to the parameters of the components and uses a random start.  Thus it can return any of the multiple MLEs on offer, so the ABC IP will not work here.  So from my point of view, any alternative will increase the difficulty of implementation as it means I cannot use the standard software.  Especially considering I can apply any other BII method without worrying about the non-unique MLE.

The paper concludes from a numerical study to the superiority of the Bayesian indirect inference of Gallant and McCulloch (2009). Which simply replaces the true likelihood with the maximal auxiliary model likelihood estimated from a simulated dataset. (This is somehow similar to our use of the empirical likelihood in the PNAS paper.) It is however moderated by the cautionary provision that “the auxiliary model [should] describe the data well”. As for empirical likelihood, I would suggest resorting to this Bayesian indirect inference as a benchmark, providing a quick if possibly dirty reference against which to test more elaborate ABC schemes. Or other approximations, like empirical likelihood or Wood’s synthetic likelihood.

Unfortunately the methods are not quick (apart from ABC IS when the scores are analytic), but good approximations can be obtained.  The majority of Bayesian methods that deal with intractable likelihoods do not target the true posterior (there are a couple of exceptions in special cases) and thus also suffer from some dirtiness, and BII does not escape from that.  But, if a reasonable auxiliary model can be found, then I would suggest that (at least one of the) BII methods will be competitive.

On reflection for BIL it is not necessary for the auxiliary model to fit the data, since the generative model being proposed may be mis-specified and also not fit the data well.  BIL needs an auxiliary model that mimics well the likelihood of the generative model for values of theta in non-negligible posterior regions.  For ABC II, we are simply looking for a good summarisation of the data.  Therefore it would seem useful if the auxiliary model did fit the data well.  Note this process is independent of the generative model being proposed.  Therefore the auxiliary model would be the same regardless of the chosen generative model.  Very different considerations indeed.

Inspired by a discussion with Anthony Lee, it appears that the (Bayesian version) of synthetic likelihood you mentioned is actually also a BIL method but where the auxiliary model is applied to the summary statistic likelihood rather than the full data likelihood.  The synthetic likelihood is nice from a numerical/computational point of view as the MLE of the auxiliary model is analytic.

MCMSki IV [day 2.5]

Posted in Mountains, pictures, Statistics, University life with tags , , , , , , , , , on January 8, 2014 by xi'an

ridge4Despite a good rest during the ski break, my cold did not get away (no magic left in this world!) and I thus had a low attention span to attend the Bayesian statistics and Population genetics session: while Jukka Corander mentioned the improvement brought by our AMIS algorithm, I had difficulties getting the nature of the model, if only because he used a blackboard-like font that made math symbols too tiny to read. (Nice fonts, otherwise!), Daniel Lawson (of vomiting Warhammer fame!) talked about the alluring notion of a statistical emulator, and Barbara Engelhardt talked about variable selection in a SNP setting. I did not get a feeling on how handling ten millions of SNPs was possible in towards a variable selection goal.  My final session of the day was actually “my” invited session on ABC methods, where Richard Everitt presented a way of mixing exact approximation with ABC and synthetic likelihood (Wood, Nature) approximations. The resulting MAVIS algorithm is  not out yet. The second speaker was Ollie Ratman, who spoke on his accurate ABC that I have discussed many times here. And Jean-Michel Marin managed to drive from Montpelier, just in time to deliver his talk on our various explorations of the ABC model choice problem.

After a quick raclette at “home”, we headed back to the second poster session, where I had enough of a clear mind and not too much of a headache (!) to have several interesting discussions, incl. a new parallelisation suggested  by Ben Calderhead, the sticky Metropolis algorithm of Luca Martino, the airport management video of Jegar Pitchforth, the mixture of Dirichlet distributions for extremes by Anne Sabourin, not mentioning posters from Warwick or Paris. At the end of the evening  I walked back to my apartment with the Blossom skis we had brought in the morning to attract registrations for the ski race: not enough to make up for the amount charged by the ski school. Too bad, especially given Anto’s efforts to get this amazing sponsoring!