approximate likelihood perspective on ABC

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , on December 20, 2018 by xi'an

George Karabatsos and Fabrizio Leisen have recently published in Statistics Surveys a fairly complete survey on ABC methods [which earlier arXival I had missed]. Listing within an extensive bibliography of 20 pages some twenty-plus earlier reviews on ABC (with further ones in applied domains)!

“(…) any ABC method (algorithm) can be categorized as either (1) rejection-, (2) kernel-, and (3) coupled ABC; and (4) synthetic-, (5) empirical- and (6) bootstrap-likelihood methods; and can be combined with classical MC or VI algorithms [and] all 22 reviews of ABC methods have covered rejection and kernel ABC methods, but only three covered synthetic likelihood, one reviewed the empirical likelihood, and none have reviewed coupled ABC and bootstrap likelihood methods.”

The motivation for using approximate likelihood methods is provided by the examples of g-and-k distributions, although the likelihood can be efficiently derived by numerical means, as shown by Pierre Jacob‘s winference package, of mixed effect linear models, although a completion by the mixed effects themselves is available for Gibbs sampling as in Zeger and Karim (1991), and of the hidden Potts model, which we covered by pre-processing in our 2015 paper with Matt Moores, Chris Drovandi, Kerrie Mengersen. The paper produces a general representation of the approximate likelihood that covers the algorithms listed above as through the table below (where t(.) denotes the summary statistic):

The table looks a wee bit challenging simply because the review includes the synthetic likelihood approach of Wood (2010), which figured preeminently in the 2012 Read Paper discussion but opens the door to all kinds of approximations of the likelihood function, including variational Bayes and non-parametric versions. After a description of the above versions (including a rather ignored coupled version) and the special issue of ABC model choice,  the authors expand on the difficulties with running ABC, from multiple tuning issues, to the genuine curse of dimensionality in the parameter (with unnecessary remarks on low-dimension sufficient statistics since they are almost surely inexistent in most realistic settings), to the mis-specified case (on which we are currently working with David Frazier and Judith Rousseau). To conclude, an worthwhile update on ABC and on the side a funny typo from the reference list!

Li, W. and Fearnhead, P. (2018, in press). On the asymptotic efficiency
of approximate Bayesian computation estimators. Biometrika na na-na.

coordinate sampler as a non-reversible Gibbs-like MCMC sampler

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , on September 12, 2018 by xi'an

In connection with the talk I gave last July in Rennes for MCqMC 2018, I posted yesterday a preprint on arXiv of the work that my [soon to defend!] Dauphine PhD student Changye Wu and I did on an alternative PDMP. In this novel avatar of the zig-zag sampler,  a  non-reversible, continuous-time MCMC sampler, that we called the Coordinate Sampler, based on a piecewise deterministic Markov process. In addition to establishing the theoretical validity of this new sampling algorithm, we show in the same line as Deligiannidis et al.  (2018) that the Markov chain it induces exhibits geometrical ergodicity for distributions which tails decay at least as fast as an exponential distribution and at most as fast as a Gaussian distribution. A few numerical examples (a 2D banana shaped distribution à la Haario et al., 1999, strongly correlated high-dimensional normals, a log-Gaussian Cox process) highlight that our coordinate sampler is more efficient than the zig-zag sampler, in terms of effective sample size.Actually, we had sent this paper before the summer as a NIPS [2018] submission, but it did not make it through [the 4900 submissions this year and] the final review process, being eventually rated above the acceptance bar but not that above!

IMS workshop [day 3]

Posted in pictures, R, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , on August 30, 2018 by xi'an

I made the “capital” mistake of walking across the entire NUS campus this morning, which is quite green and pretty, but which almost enjoys an additional dimension brought by such an intense humidity that one feels having to get around this humidity!, a feature I have managed to completely erase from my memory of my previous visit there. Anyway, nothing of any relevance. oNE talk in the morning was by Markus Eisenbach on tools used by physicists to speed up Monte Carlo methods, like the Wang-Landau flat histogram, towards computing the partition function, or the distribution of the energy levels, definitely addressing issues close to my interest, but somewhat beyond my reach for using a different language and stress, as often in physics. (I mean, as often in physics talks I attend.) An idea that came out clear to me was to bypass a (flat) histogram target and aim directly at a constant slope cdf for the energy levels. (But got scared away by the Fourier transforms!)

Lawrence Murray then discussed some features of the Birch probabilistic programming language he is currently developing, especially a fairly fascinating concept of delayed sampling, which connects with locally-optimal proposals and Rao Blackwellisation. Which I plan to get back to later [and hopefully sooner than later!].

In the afternoon, Maria de Iorio gave a talk about the construction of nonparametric priors that create dependence between a sequence of functions, a notion I had not thought of before, with an array of possibilities when using the stick breaking construction of Dirichlet processes.

And Christophe Andrieu gave a very smooth and helpful entry to partly deterministic Markov processes (PDMP) in preparation for talks he is giving next week for the continuation of the workshop at IMS. Starting with the guided random walk of Gustafson (1998), which extended a bit later into the non-reversible paper of Diaconis, Holmes, and Neal (2000). Although I had a vague idea of the contents of these papers, the role of the velocity ν became much clearer. And premonitory of the advances made by the more recent PDMP proposals. There is obviously a continuation with the equally pedagogical talk Christophe gave at MCqMC in Rennes two months [and half the globe] ago,  but the focus being somewhat different, it really felt like a new talk [my short term memory may also play some role in this feeling!, as I now remember the discussion of Hilderbrand (2002) for non-reversible processes]. An introduction to the topic I would recommend to anyone interested in this new branch of Monte Carlo simulation! To be followed by the most recently arXived hypocoercivity paper by Christophe and co-authors.

indecent exposure

Posted in Statistics with tags , , , , , , , , , , , , on July 27, 2018 by xi'an

While attending my last session at MCqMC 2018, in Rennes, before taking a train back to Paris, I was confronted by this radical opinion upon our previous work with Matt Moores (Warwick) and other coauthors from QUT, where the speaker, Maksym Byshkin from Lugano, defended a new approach for maximum likelihood estimation using novel MCMC methods. Based on the point fixe equation characterising maximum likelihood estimators for exponential families, when theoretical and empirical moments of the natural statistic are equal. Using a Markov chain with stationary distribution the said exponential family, the fixed point equation can be turned into a zero divergence equation, requiring simulation of pseudo-data from the model, which depends on the unknown parameter. Breaking this circular argument, the authors note that simulating pseudo-data that reproduce the observed value of the sufficient statistic is enough. Which is related with Geyer and Thomson (1992) famous paper about Monte Carlo maximum likelihood estimation. From there I was and remain lost as I cannot see why a derivative of the expected divergence with respect to the parameter θ can be computed when this divergence is found by Monte Carlo rather than exhaustive enumeration. And later used in a stochastic gradient move on the parameter θ… Especially when the null divergence is imposed on the parameter. In any case, the final slide shows an application to a large image and an Ising model, solving the problem (?) in 140 seconds and suggesting indecency, when our much slower approach is intended to produce a complete posterior simulation in this context.

Rennes street art [#3]

Posted in pictures, Running, Travel with tags , , , , on July 24, 2018 by xi'an

unusual clouds [jatp]

Posted in pictures, Travel, Wines with tags , , , , , , , , , , on July 19, 2018 by xi'an

subset sampling

Posted in Statistics with tags , , , , , , , , on July 13, 2018 by xi'an

A paper by Au and Beck (2001) was mentioned during a talk at MCqMC 2018 in Rennes and I checked Probabilistic Engineering Mechanics for details. There is no clear indication that the subset simulation advocated therein is particularly effective. The core idea is to obtain the probability to belong to a small set A by a cascading formula, namely the product of the probability to belong to A¹, then the conditional probability to belong to A² given A¹, &tc. When the subsets A¹, A², …, A constitute a decreasing embedded sequence. The simulation conditional on being in one of the subsets $A^i$ is operated by a random-walk Metropolis-within-Gibbs scheme, with an additional rejection when the value is not in the said subset. (Surprisingly, the authors re-establish the validity of this scheme.) Hence the proposal faces similar issues as nested sampling, except that the nested subsets here are defined quite differently as they are essentially free, provided they can be easily evaluated. Each of the random walks need be scaled, the harder a task because this depends on the corresponding subset volume. The subsets $A^i$ themselves are rarely defined in a natural manner, except when being tail events. And need to be calibrated so that the conditional probability of falling into each remains large enough, the cost of free choice. The Markov chain on the previous subset $A^i$ can prove useful to build the next subset $A^{i+1}$, but there is no general principle behind this remark. (If any, this is connected with X entropy.) But else, the past chains are very much wasted, compared with, say, an SMC treatment of the problem. The paper also notices that starting a Markov chain in the set $A^{i+1}$ means there is no burnin time and hence that the probability estimators are thus unbiased. (This creates a correlation between successive Markov chains, but I think it could be ignored if the starting point was chosen at random or after a random number of extra steps.) The authors further point out that the chain may fail to be ergodic, if the proposal distribution lacks energy to link connected regions of the current subset $A^i$. They suggest using multiple chains with multiple starting points, which alleviates the issue only to some extent, as it ultimately depends on the spread of the starting points. As acknowledged in the paper.