## Archive for Berlin

## Wow!

Posted in pictures, Running with tags 2:01:39, Berlin, Berlin marathon, Brandenburg Gate, Eliud Kipchoge, marathon, World record on September 16, 2018 by xi'an## Markov chain importance sampling

Posted in Books, pictures, Running, Statistics, Travel, University life with tags Berlin, Euler discretisation, Freie Universität Berlin, importance sampling, Ingmar Schuster, Langevin MCMC algorithm, marginal, MCMC algorithms, Metropolis-Hastings algorithm, Rao-Blackwellisation, Université Paris Dauphine, variance reduction on May 31, 2018 by xi'an**I**ngmar Schuster (formerly a postdoc at Dauphine and now in Freie Universität Berlin) and Ilja Klebanov (from Berlin) have recently arXived a paper on recycling proposed values in [a rather large class of] Metropolis-Hastings and unadjusted Langevin algorithms. This means using the proposed variates of one of these algorithms as in an importance sampler, with an importance weight going from the target over the (fully conditional) proposal to the target over the marginal stationary target. In the Metropolis-Hastings case, since the later is not available in most setups, the authors suggest using a Rao-Blackwellised nonparametric estimate based on the entire MCMC chain. Or a subset.

“Our estimator refutes the folk theorem that it is hard to estimate [the normalising constant] with mainstream Monte Carlo methods such as Metropolis-Hastings.”

The paper thus brings an interesting focus on the proposed values, rather than on the original Markov chain, which naturally brings back to mind the derivation of the joint distribution of these proposed values we made in our (1996) Rao-Blackwellisation paper with George Casella. Where we considered a parametric and non-asymptotic version of this distribution, which brings a guaranteed improvement to MCMC (Metropolis-Hastings) estimates of integrals. In subsequent papers with George, we tried to quantify this improvement and to compare different importance samplers based on some importance sampling corrections, but as far as I remember, we only got partial results along this way, and did not cover the special case of the normalising constant Þ… Normalising constants did not seem such a pressing issue at that time, I figure. (A *Monte Carlo 101* question: how can we be certain the importance sampler offers a finite variance?)

Ingmar’s views about this:

I think this is interesting future work. My intuition is that for Metropolis-Hastings importance sampling with random walk proposals, the variance is guaranteed to be finite because the importance distribution ρ_θ is a convolution of your target ρ with the random walk kernel q. This guarantees that the tails of ρ_θ are no lighter than those of ρ. What other forms of q mean for the tails of ρ_θ I have less intuition about.

When considering the Langevin alternative with transition (4), I was first confused and thought it was incorrect for moving from one value of Y (proposal) to the next. But that’s what unadjusted means in “unadjusted Langevin”! As pointed out in the early Langevin literature, e.g., by Gareth Roberts and Richard Tweedie, using a discretised Langevin diffusion in an MCMC framework means there is a risk of non-stationarity & non-ergodicity. Obviously, the corrected (MALA) version is more delicate to approximate (?) but at the very least it ensures the Markov chain does not diverge. Even when the unadjusted Langevin has a stationary regime, its joint distribution is likely quite far from the joint distribution of a proper discretisation. Now this also made me think about a parameterised version in the 1996 paper spirit, but there is nothing specific about MALA that would prevent the implementation of the general principle. As for the unadjusted version, the joint distribution is directly available. (But not necessarily the marginals.)

Here is an answer from Ingmar about that point

Personally, I think the most interesting part is the practical performance gain in terms of estimation accuracy for fixed CPU time, combined with the convergence guarantee from the CLT. ULA was particularly important to us because of the papers of Arnak Dalalyan, Alain Durmus & Eric Moulines and recently from Mike Jordan’s group, which all look at an unadjusted Langevin diffusion (and unimodal target distributions). But MALA admits a Metropolis-Hastings importance sampling estimator, just as Random Walk Metropolis does – we didn’t include MALA in the experiments to not get people confused with MALA and ULA. But there is no delicacy involved whatsoever in approximating the marginal MALA proposal distribution. The beauty of our approach is that it works for almost all Metropolis-Hastings algorithms where you can evaluate the proposal density q, there is no constraint to use random walks at all (we will emphasize this more in the paper).

## Berlin [and Vienna] noir [book review]

Posted in Statistics with tags Alone in Berlin, Berlin, Berlin noir, book reviews, Dachau, Graham Greene, Nazi State, Raymond Chandler, Reinhart Heydrich, Wien, WW II on August 17, 2017 by xi'an**W**hile in Cambridge last month, I picked a few books from a local bookstore as fodder for my incoming vacations. Including this omnibus volume made of the first three books by Philip Kerr featuring Bernie Gunther, a private and Reich detective in Nazi Germany, namely, *March Violets* (1989), *The Pale Criminal* (1990), and *A German Requiem* (1991). (Book that I actually read before the vacations!) The stories take place before the war, in 1938, and right after, in 1946, in Berlin and Vienna. The books centre on a German version of Philip Marlowe, wise cracks included, with various degrees of success. (There actually is a silly comparison with Chandler on the back of the book! And I found somewhere else a similarly inappropriate comparison with Graham Greene‘s The Third Man…) Although I read the whole three books in a single week, which clearly shows some undeniable addictive quality in the plots, I find those plots somewhat shallow and contrived, especially the second one revolving around a serial killer of young girls that aims at blaming Jews for those crimes and at justifying further Nazi persecutions. Or the time spent in Dachau by Bernie Gunther as undercover agent for Heydrich. If anything, the third volume taking place in post-war Berlin and Wien is much better at recreating the murky atmosphere of those cities under Allied occupations. But overall there is much too much info-dump passages in those novels to make them a good read. The author has clearly done his documentation job correctly, from the early homosexual persecutions to Kristallnacht, to the fights for control between the occupying forces, but the information about the historical context is not always delivered in the most fluent way. And having the main character working under Heydrich, then joining the SS, does make relating to him rather unlikely, to say the least. It is hence unclear to me why those books are so popular, apart from the easy marketing line that stories involving Nazis are more likely to sell… Nothing to be compared with the fantastic Alone in Berlin, depicting the somewhat senseless resistance of a Berliner during the Nazi years, dropping hand-written messages against the regime under strangers’ doors.

## seeking the error in nested sampling

Posted in pictures, Statistics, Travel with tags Berlin, curse of dimensionality, error assessment, John Skilling, Monte Carlo error, nested sampling, Nicolas Chopin on April 13, 2017 by xi'an**A** newly arXived paper on the error in nested sampling, written by Higson and co-authors, and read in Berlin, looks at the difficult task of evaluating the sampling error of nested sampling. The conclusion is essentially negative in that the authors recommend multiple runs of the method to assess the magnitude of the variability of the output by bootstrap, i.e. to call for the most empirical approach…

The core of this difficulty lies in the half-plug-in, half-quadrature, half-Monte Carlo (!) feature of the nested sampling algorithm, in that (i) the truncation of the unit interval is based on a expectation of the mass of each shell (i.e., the zone between two consecutive isoclines of the likelihood, (ii) the evidence estimator is a quadrature formula, and (iii) the level of the likelihood at the truncation is replaced with a simulated value that is not even unbiased (and correlated with the previous value in the case of an MCMC implementation). As discussed in our paper with Nicolas, the error in the evidence approximation is of the same order as other Monte Carlo methods in that it gets down like the square root of the number of terms at each iteration. Contrary to earlier intuitions that focussed on the error due to the quadrature.

But the situation is much less understood when the resulting sample is used for estimation of quantities related with the posterior distribution. With no clear approach to assess and even less correct the resulting error, since it is not solely a Monte Carlo error. As noted by the authors, the quadrature approximation to the univariate integral replaces the unknown prior weight of a shell with its Beta order statistic expectation *and* the average of the likelihood over the shell with a single (uniform???) realisation. Or the mean value of a transform of the parameter with a single (biased) realisation. Since most posterior expectations can be represented as integrals over likelihood levels of the average value over an iso-likelihood contour. The approach advocated in the paper involved multiple threads of an “unwoven nested sampling run”, which means launching n nested sampling runs with one living term from the n currents living points in the current nested sample. (Those threads may then later be recombined into a single nested sample.) This is the starting point to a nested flavour of bootstrapping, where threads are sampled with replacement, from which confidence intervals and error estimates can be constructed. (The original notion appears in Skilling’s 2006 paper, but I missed it.)

The above graphic is an attempt within the paper at representing the (marginal) posterior of a transform f(θ). That I do not fully understand… The notations are rather horrendous as X is not the data but the prior probability for the likelihood to be above a given bound which is actually the corresponding quantile. (There is no symbol for data and £ is used for the likelihood function as well as realisations of the likelihood function…) A vertical slice on the central panel gives the posterior distribution of f(θ) given the event that the likelihood is in the corresponding upper tail. Or given the corresponding shell (?).

## oxwasp@amazon.de

Posted in Books, Kids, pictures, Running, Statistics, Travel, University life with tags Amazon, Berlin, bier, Brauhaus Lemke, doubly intractable problems, Germany, Google, Ising model, machine learning, normalising constant, optimisation, OxWaSP, quantum computers, Spree, Stadtmitte, University of Oxford, University of Warwick, workshop on April 12, 2017 by xi'an**T**he reason for my short visit to Berlin last week was an OxWaSP (Oxford and Warwick Statistics Program) workshop hosted by Amazon Berlin with talks between statistics and machine learning, plus posters from our second year students. While the workshop was quite intense, I enjoyed very much the atmosphere and the variety of talks there. (Just sorry that I left too early to enjoy the social programme at a local brewery, Brauhaus Lemke, and the natural history museum. But still managed nice runs east and west!) One thing I found most interesting (if obvious in retrospect) was the different focus of academic and production talks, where the later do not aim at a full generality or at a guaranteed improvement over the existing, provided the new methodology provides a gain in efficiency over the existing.

This connected nicely with me reading several Nature articles on quantum computing during that trip, where researchers from Google predict commercial products appearing in the coming five years, even though the technology is far from perfect and the outcome qubit error prone. Among the examples they provided, quantum simulation (not meaning what I consider to be *simulation*!), quantum optimisation (as a way to overcome multimodality), and quantum sampling (targeting given probability distributions). I find the inclusion of the latest puzzling in that simulation (in that sense) shows very little tolerance for errors, especially systematic bias. It may be that specific quantum architectures can be designed for specific probability distributions, just like some are already conceived for optimisation. (It may even be the case that quantum solutions are (just next to) available for intractable constants as in Ising or Potts models!)

## Statlearn17, Lyon

Posted in Kids, pictures, R, Statistics, Travel, University life with tags Berlin, conference, France, French Alps, Lyon, machine learning, R, SFDS, Statlearn 2017, train, Université Lumière Lyon 2 on April 6, 2017 by xi'an**T**oday and tomorrow, I am attending the Statlearn17 conference in Lyon, France. Which is a workshop with one-hour talks on statistics and machine learning. And which makes for the second workshop on machine learning in two weeks! Yesterday there were two tutorials in R, but I only took the train to Lyon this morning: it will be a pleasant opportunity to run tomorrow through a city I have not truly ever visited, if X’ed so many times driving to the Alps. Interestingly, the trip started in Paris with me sitting in the train next to another speaker at the conference, despite having switched seat and carriage with another passenger! Speaker whom I did not know beforehand and could only identify him by his running R codes at 300km/h.