Archive for arXiv

MCMC importance samplers for intractable likelihoods

Posted in Books, pictures, Statistics with tags , , , , , , , , , , , on May 3, 2019 by xi'an

Jordan Franks just posted on arXiv his PhD dissertation at the University of Jyväskylä, where he discuses several of his works:

  1. M. Vihola, J. Helske, and J. Franks. Importance sampling type estimators based on approximate marginal MCMC. Preprint arXiv:1609.02541v5, 2016.
  2. J. Franks and M. Vihola. Importance sampling correction versus standard averages of reversible MCMCs in terms of the asymptotic variance. Preprint arXiv:1706.09873v4, 2017.
  3. J. Franks, A. Jasra, K. J. H. Law and M. Vihola.Unbiased inference for discretely observed hidden Markov model diffusions. Preprint arXiv:1807.10259v4, 2018.
  4. M. Vihola and J. Franks. On the use of ABC-MCMC with inflated tolerance and post-correction. Preprint arXiv:1902.00412, 2019

focusing on accelerated approximate MCMC (in the sense of pseudo-marginal MCMC) and delayed acceptance (as in our recently accepted paper). Comparing delayed acceptance with MCMC importance sampling to the advantage of the later. And discussing the choice of the tolerance sequence for ABC-MCMC. (Although I did not get from the thesis itself the target of the improvement discussed.)

deep and embarrassingly parallel MCMC

Posted in Books, pictures, Statistics with tags , , , , , , , on April 9, 2019 by xi'an

Diego Mesquita, Paul Blomstedt, and Samuel Kaski (from Helsinki, like the above picture) just arXived a paper on embarrassingly parallel MCMC. Following a series of papers discussed on this ‘og in the past. They use a deep learning approach of Dinh et al. (2017) to the computation of the probability density of a convoluted and non-volume-preserving transform of a given random variable to turn multiple samples from sub-posteriors [corresponding to the k k-th roots of the true posterior] into a sample from the true posterior. If I understand correctly the argument [on page 4], the deep neural network provides a density estimate that apparently does better than traditional non-parametric density estimates. Maybe by being more efficient than a Parzen-Rosenblat estimator which is of order the number of simulations… For any value of θ, the estimate of the true target is the product of these estimates and for a value of θ simulated from one of the subposteriors an importance weight naturally ensues. However, for a one-dimensional transform of θ, h(θ), I would prefer estimating first the density of h(θ) for each sample and then construct an importance weight. If only to avoid the curse of dimension.

On various benchmarks, like the banana-shaped 2D target above, the proposed method (NAP) does better. Even in relatively high dimensions. Given that the overall computing times are not produced, with only the calibration that the same number of subsamples were produced for all methods, it would be interesting to test the same performances on even higher dimensions and larger population sizes.

automatic adaptation of MCMC algorithms

Posted in pictures, Statistics with tags , , , , , , , on March 4, 2019 by xi'an

“A typical adaptive MCMC sampler will approximately optimize performance given the kind of sampler chosen in the first place, but it will not optimize among the variety of samplers that could have been chosen.”

Last February (2018), Dao Nguyen and five co-authors arXived a paper that I missed. On a new version of adaptive MCMC that aims at selecting a wider range of proposal kernels. Still requiring a by-hand selection of this collection of kernels… Among the points addressed, beyond the theoretical guarantees that the adaptive scheme does not jeopardize convergence to the proper target, are a meta-exploration of the set of combinations of samplers and integration of the computational speed in the assessment of each sampler. Including the very difficulty of assessing mixing. One could deem the index of the proposal as an extra (cyber-)parameter to its generic parameter (like the scale in the random walk), but the discreteness of this index makes the extension more delicate than expected. And justifies the distinction between internal and external parameters. The notion of a worst-mixing dimension is quite appealing and connects with the long-term hope that one could spend the maximum fraction of the sampler runtime over the directions that are poorly mixing, while still keeping the target as should be. The adaptive scheme is illustrated on several realistic models with rather convincing gains in efficiency and time.

The convergence tools are inspired from Roberts and Rosenthal (2007), with an assumption of uniform ergodicity over all kernels considered therein which is both strong and delicate to assess in practical settings. Efficiency is rather unfortunately defined in terms of effective sample size, which is a measure of correlation or lack thereof, but which does not relate to the speed of escape from the basin of attraction of the starting point. I also wonder at the pertinence of the estimation of the effective sample size when the chain is based on different successive kernels, since the lack of correlation may be due to another kernel. Another calibration issue is the internal clock that relates to the average number of iterations required to tune properly a specific kernel, which again may be difficult to assess in a realistic situation. A last query is whether or not this scheme could be compared with an asynchronous (and valid) MCMC approach that exploits parallel capacities of the computer.

rethinking the ESS

Posted in Statistics with tags , , , , , , , , , on September 14, 2018 by xi'an

Following Victor Elvira‘s visit to Dauphine, one and a half year ago, where we discussed the many defects of ESS as a default measure of efficiency for importance sampling estimators, and then some more efforts (mostly from Victor!) to formalise these criticisms, Victor, Luca Martino and I wrote a paper on this notion, now arXived. (Victor most kindly attributes the origin of the paper to a 2010 ‘Og post on the topic!) The starting thread of the (re?)analysis of this tool introduced by Kong (1992) is that the ESS used in the literature is an approximation to the “true” ESS, generally unavailable. Approximation that is pretty crude and hence impacts the relevance of using it as the assessment tool for comparing importance sampling methods. In the paper, we re-derive (with the uttermost precision) the resulting approximation and list the many assumptions that [would] validate this approximation. The resulting drawbacks are many, from the absurd property of always being worse than direct sampling, to being independent from the target function and from the sample per se. Since only importance weights matter. This list of issues is not exactly brand new, but we think it is worth signaling given the fact that this approximation has been widely used in the last 25 years, due to its simplicity, as a practical rule of thumb [!] in a wide variety of importance sampling methods. In continuation of the directions drafted in Martino et al. (2017), we also indicate some alternative notions of importance efficiency. Note that this paper does not cover the use of ESS for MCMC algorithms, where it is somewhat more legit, if still too rudimentary to really catch convergence or lack thereof! [Note: I refrained from the post title resinking the ESS…]

coordinate sampler as a non-reversible Gibbs-like MCMC sampler

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , on September 12, 2018 by xi'an

In connection with the talk I gave last July in Rennes for MCqMC 2018, I posted yesterday a preprint on arXiv of the work that my [soon to defend!] Dauphine PhD student Changye Wu and I did on an alternative PDMP. In this novel avatar of the zig-zag sampler,  a  non-reversible, continuous-time MCMC sampler, that we called the Coordinate Sampler, based on a piecewise deterministic Markov process. In addition to establishing the theoretical validity of this new sampling algorithm, we show in the same line as Deligiannidis et al.  (2018) that the Markov chain it induces exhibits geometrical ergodicity for distributions which tails decay at least as fast as an exponential distribution and at most as fast as a Gaussian distribution. A few numerical examples (a 2D banana shaped distribution à la Haario et al., 1999, strongly correlated high-dimensional normals, a log-Gaussian Cox process) highlight that our coordinate sampler is more efficient than the zig-zag sampler, in terms of effective sample size.Actually, we had sent this paper before the summer as a NIPS [2018] submission, but it did not make it through [the 4900 submissions this year and] the final review process, being eventually rated above the acceptance bar but not that above!

troubling trends in machine learning

Posted in Books, pictures, Running, Statistics, University life with tags , , , , , , , , , , , , , on July 25, 2018 by xi'an

This morning, in Coventry, while having an n-th cup of tea after a very early morning run (light comes early at this time of the year!), I spotted an intriguing title in the arXivals of the day, by Zachary Lipton and Jacob Steinhard. Addressing the academic shortcomings of machine learning papers. While I first thought little of the attempt to address poor scholarship in the machine learning literature, I read it with growing interest and, although I am pessimistic at the chances of inverting the trend, considering the relentless pace and massive production of the community, I consider the exercise worth conducting, if only to launch a debate on the excesses found in the literature.

“…desirable characteristics:  (i) provide intuition to aid the reader’s understanding, but clearly distinguish it from stronger conclusions supported by evidence; (ii) describe empirical investigations that consider and rule out alternative hypotheses; (iii) make clear the relationship between theoretical analysis and intuitive or empirical claims; and (iv) use language to empower the reader, choosing terminology to avoid misleading or unproven connotations, collisions with other definitions, or conflation with other related but distinct concepts”

The points made by the authors are (p.1)

  1. Failure to distinguish between explanation and speculation
  2. Failure to identify the sources of empirical gains
  3. Mathiness
  4. Misuse of language

Again, I had misgiving about point 3., but this is not an anti-maths argument, rather about the recourse to vaguely connected or oversold mathematical results as a way to support a method.

Most interestingly (and living dangerously!), the authors select specific papers to illustrate their point, picking from well-established authors and from their own papers, rather than from junior authors. And also include counter-examples of papers going the(ir) right way. Among the recommendations for emerging from the morass of poor scholarship papers, they suggest favouring critical writing and retrospective surveys (provided authors can be found for these!). And mention open reviews before I can mention these myself. One would think that published anonymous reviews are a step in the right direction, I would actually say that this should be the norm (plus or minus anonymity) for all journals or successors of journals (PCis coming strongly to mind). But requiring more work from the referees implies rewards for said referees, as done in some biology and hydrology journals I refereed for (and PCIs of course).

accelerating MCMC

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , on April 11, 2018 by xi'an

As forecasted a rather long while ago (!), I wrote a short and incomplete survey on some approaches to accelerating MCMC. With the massive help of Victor Elvira (Lille), Nick Tawn (Warwick) and Changye Wu (Dauphine). Survey which current version just got arXived and which has now been accepted by WIREs Computational Statistics. The typology (and even the range of methods) adopted here is certainly mostly arbitrary, with suggestions for different divisions made by a very involved and helpful reviewer. While we achieved a quick conclusion to the review process, suggestions and comments are most welcome! Even if we cannot include every possible suggestion, just like those already made on X validated. (WIREs stands for Wiley Interdisciplinary Reviews and its dozen topics cover several fields, from computational stats to biology, to medicine, to engineering.)