**A**n ‘Og’s reader pointed me to this paper by Li and Malik, which made it to arXiv after not making it to NIPS. While the NIPS reviews were not particularly informative and strongly discordant, the authors point out in the comments that they are available for the sake of promoting discussion. (As made clear in earlier posts, I am quite supportive of this attitude! *Disclaimer: I was not involved in an evaluation of this paper, neither for NIPS nor for another conference or journal!!*) Although the paper does not seem to mention ABC in the setting of implicit likelihoods and generative models, there is a reference to the early (1984) paper by Peter Diggle and Richard Gratton that is often seen as the ancestor of ABC methods. The authors point out numerous issues with solutions proposed for parameter estimation in such implicit models. For instance, for GANs, they signal that “minimizing the Jensen-Shannon divergence or the Wasserstein distance between the empirical data distribution and the model distribution does not necessarily minimize the same between the true data distribution and the model distribution.” (Not mentioning the particular difficulty with Bayesian GANs.) Their own solution is the implicit maximum likelihood estimator, which picks the value of the parameter θ bringing a simulated sample the closest to the observed sample. Closest in the sense of the Euclidean distance between both samples. Or between the minimum of several simulated samples and the observed sample. (The modelling seems to imply the availability of n>1 observed samples.) They advocate using a stochastic gradient descent approach for finding the optimal parameter θ which presupposes that the dependence between θ and the simulated samples is somewhat differentiable. (And this does not account for using a min, which would make differentiation close to impossible.) The paper then meanders in a lengthy discussion as to whether maximising the likelihood makes sense, with a rather naïve view on why using the empirical distribution in a Kullback-Leibler divergence does not make sense! What does not make sense is considering the finite sample approximation to the Kullback-Leibler divergence with the true distribution in my opinion.

## Archive for Peter Diggle

## Implicit maximum likelihood estimates

Posted in Statistics with tags ABC, Approximate Bayesian computation, GANs, Hyvärinen score, Kullback-Leibler divergence, likelihood-free methods, maximum likelihood estimation, NIPS 2018, Peter Diggle, untractable normalizing constant, Wasserstein distance on October 9, 2018 by xi'an## statistics: a data science for the 21st century

Posted in Statistics with tags data science, Lancaster University, Peter Diggle, Royal Statistical Society, University of Warwick, Warwick Public Lecture on May 15, 2018 by xi'an## Approximate Maximum Likelihood Estimation

Posted in Books, Mountains, pictures, Statistics, Travel, University life with tags ABC, Austria, Don Rubin, James Spall, Kiefer-Wolfowitz algorithm, Linz, optimisation, Peter Diggle, stochastic approximation, stochastic gradient on September 21, 2015 by xi'an**B**ertl *et al.* arXived last July a paper on a maximum likelihood estimator based on an alternative to ABC techniques. And to indirect inference. (One of the authors in *et al.* is Andreas Futschik whom I visited last year in Linz.) Paper that I only spotted when gathering references for a reading list on ABC… The method is related to the “original ABC paper” of Diggle and Gratton (1984) which, parallel to Rubin (1984), contains in retrospect the idea of ABC methods. The starting point is stochastic approximation, namely the optimisation of a function of a parameter θ when written as an expectation of a random variable Y, **E**[Y|θ], as in the Kiefer-Wolfowitz algorithm. However, in the case of the likelihood function, there is rarely an unbiased estimator and the authors propose instead to use a kernel density estimator of the density of the summary statistic. This means that, at each iteration of the Kiefer-Wolfowitz algorithm, two sets of observations and hence of summary statistics are simulated and two kernel density estimates derived, both to be applied to the observed summary. The sequences underlying the Kiefer-Wolfowitz algorithm are taken from (the excellent optimisation book of) Spall (2003). Along with on-the-go adaptation and convergence test.

The theoretical difficulty in this extension is however that the kernel density estimator is not unbiased and thus that, rigorously speaking, the validation of the Kiefer-Wolfowitz algorithm does not apply here. On the practical side, the need for multiple starting points and multiple simulations of pseudo-samples may induce considerable time overload. Especially if bootstrap is used to evaluate the precision of the MLE approximation. Besides normal and M/G/1 queue examples, the authors illustrate the approach on a population genetic dataset of Borneo and Sumatra orang-utans. With 5 parameters and 28 summary statistics. Which thus means using a kernel density estimator in dimension 28, a rather perilous adventure..!

## When Buffon meets Bertrand

Posted in R, Statistics, Travel with tags Bertrand's paradox, Buffon's needle, Durham university, Julian Besag, Peter Diggle, Von Mises on April 7, 2011 by xi'an**W**hen Peter Diggle gave his “short history” of spatial statistics this morning (I typed this in the taxi from Charles de Gaulle airport, after waiting one hour for my bag!), he started with a nice slide about Buffon’s needle (and Buffon’s portrait), since Julian Besag was often prone to give this problem as a final exam to Durham students (one of whom is responsible for the candidate’s formula). This started me thinking about how this was open to a Bertrand’s paradox of its own. Indeed, randomness for the needle throw can be represented in many ways:

- needle centre uniformly distributed over the room (or the perpendicular to the boards) with a random orientation (with a provision to have the needle fit);
- needle endpoint uniformly distributed over the room (again a uniform over the perpendicular is enough) with a random orientation (again with a constraint);
- random orientation from one corner of the room and a uniform location of the centre on the resulting line (with constraints on both ends for the needle to fit);
- random orientation from one corner of the room and a uniform location of one endpoint on the resulting line, plus a Bernoulli generation to decide on the orientation (with constraints on both ends for the needle to fit);
- &tc.

**I** did not have time to implement those different generation mechanisms in R, but have little doubt they should lead to different probabilities of intersection between the needle and one of the board separations. I actually found a web-page at the University of Alabama Huntsville addressing this problem through exercises (plus 20,000 related entries! Including von Mises‘ ** Probability, Statistics and Truth** itself. A book I should read one of those days, following Andrew.). Note that each version corresponds to a physical mechanism. Thus that there is no way to distinguish between them. Had I time, I would also like to consider the limiting case when the room gets infinite as, presumably, some of those proposals would end up being identical.