This Bayesian indirect inference paper by Chris Drovandi and Tony Pettitt was discussed on the ‘Og two weeks ago and Chris sent me the following comments.
…unsurprisingly, the performances of ABC comparing true data of size n with synthetic data of size m>n are not great. However, there exists another way of reducing the variance in the synthetic data, namely by repeating simulations of samples of size n and averaging the indicators for proximity, resulting in a frequency rather than a 0-1 estimator. See e.g. Del Moral et al. (2009). In this sense, increasing the computing power reduces the variability of the ABC approximation. (And I thus fail to see the full relevance of Result 1.)
Taking the average of the indicators from multiple simulations will reduce the variability of the estimated ABC likelihood but because it is only still an unbiased estimate it will not alter the target and will not improve the ABC approximation (Andrieu and Roberts 2009). It will only have the effect of improving the mixing of MCMC ABC. Result 1 is used to contrast ABC II and BIL as they behave quite differently as n is increased.
The authors make several assumptions of unicity that I somewhat find unclear. While assuming that the MLE for the auxiliary model is unique could make sense (Assumption 2), I do not understand the corresponding indexing of this estimator (of the auxiliary parameter) on the generating (model) parameter θ. It should only depend on the generated/simulated data x. The notion of a noisy mapping is just confusing to me.
The dependence on θ is a little confusing I agree (especially in the context of ABC II methods). It starts to become more clear in the context of BIL. As n goes to infinity, the effect of the simulated data is removed and then we obtain the function φ(θ) (so we need to remember which θ simulated the data), which is referred to as the mapping or binding function in the II literature. If we somehow knew the binding function, BIL would proceed straightforwardly. But of course we don’t in practice, so we try to estimate it via simulated data (which, for computational reasons, needs to be a finite sample) from the true model based on theta. Thus we obtain a noisy estimate of the mapping. One way forward might be to fit some (non-parametric?) regression model to smooth out the noise and try to recover the true mapping (without ever taking n to infinity) and run a second BIL with this estimated mapping. I plan to investigate this in future work.
The assumption that the auxiliary score function at the auxiliary MLE for the observed data and for a simulated dataset (Assumption 3) is unique proceeds from the same spirit. I however fail to see why it matters so much. If the auxiliary MLE is the result of a numerical optimisation algorithm, the numerical algorithm may return local modes. This only adds to the approximative effect of the ABC-I schemes.
The optimiser failing to find the MLE (local mode) is certainly an issue shared by all BII methods, apart from ABC IS (which only requires 1 optimisation, so more effort to find the MLE can be applied here). Assuming the optimiser can obtain the MLE, I think the uniqueness assumptions makes sense. It basically says that, for a particular simulated dataset we would like a unique value for the ABC discrepancy function.
Given that the paper does not produce convergence results for those schemes, unless the auxiliary model contains the genuine model, such theoretical assumptions do not feel that necessary.
Actually, the ABC II methods will never converge to the true posterior (in general) due to lack of sufficiency. This is even the case if the true model is a special case of the auxiliary model! (in which case BIL can converge to the true posterior)
The paper uses normal mixtures as an auxiliary model: the multimodality of this model should not be such an hindrance (and reordering is transparent, i.e. does not “reduce the flexibility of the auxiliary model”, and does not “increase the difficulty of implementation”, as stated p.16).
Thanks for your comment. I need to think about this more as I am not an expert on mixture modelling. The standard EM algorithm in Matlab does not apply any ordering to the parameters of the components and uses a random start. Thus it can return any of the multiple MLEs on offer, so the ABC IP will not work here. So from my point of view, any alternative will increase the difficulty of implementation as it means I cannot use the standard software. Especially considering I can apply any other BII method without worrying about the non-unique MLE.
The paper concludes from a numerical study to the superiority of the Bayesian indirect inference of Gallant and McCulloch (2009). Which simply replaces the true likelihood with the maximal auxiliary model likelihood estimated from a simulated dataset. (This is somehow similar to our use of the empirical likelihood in the PNAS paper.) It is however moderated by the cautionary provision that “the auxiliary model [should] describe the data well”. As for empirical likelihood, I would suggest resorting to this Bayesian indirect inference as a benchmark, providing a quick if possibly dirty reference against which to test more elaborate ABC schemes. Or other approximations, like empirical likelihood or Wood’s synthetic likelihood.
Unfortunately the methods are not quick (apart from ABC IS when the scores are analytic), but good approximations can be obtained. The majority of Bayesian methods that deal with intractable likelihoods do not target the true posterior (there are a couple of exceptions in special cases) and thus also suffer from some dirtiness, and BII does not escape from that. But, if a reasonable auxiliary model can be found, then I would suggest that (at least one of the) BII methods will be competitive.
On reflection for BIL it is not necessary for the auxiliary model to fit the data, since the generative model being proposed may be mis-specified and also not fit the data well. BIL needs an auxiliary model that mimics well the likelihood of the generative model for values of theta in non-negligible posterior regions. For ABC II, we are simply looking for a good summarisation of the data. Therefore it would seem useful if the auxiliary model did fit the data well. Note this process is independent of the generative model being proposed. Therefore the auxiliary model would be the same regardless of the chosen generative model. Very different considerations indeed.
Inspired by a discussion with Anthony Lee, it appears that the (Bayesian version) of synthetic likelihood you mentioned is actually also a BIL method but where the auxiliary model is applied to the summary statistic likelihood rather than the full data likelihood. The synthetic likelihood is nice from a numerical/computational point of view as the MLE of the auxiliary model is analytic.