**A**n ICLR 2019 paper by Neklyudov, Egorov and Vetrov on an optimal choice of the proposal in an independent Metropolis algorithm I discovered via an X validated question. Namely whether or not the expected Metropolis-Hastings acceptance ratio is always one (which it is not when the support of the proposal is restricted). The paper mentions the domination of the Accept-Reject algorithm by the associated independent Metropolis-Hastings algorithm, which has actually been stated in our Monte Carlo Statistical Methods (1999, Lemma 6.3.2) and may prove even older. The authors also note that the expected acceptance probability is equal to one minus the total variation distance between the joint defined as target x Metropolis-Hastings proposal distribution and its time-reversed version. Which seems to suffer from the same difficulty as the one mentioned in the X validated question. Namely that it only holds when the support of the Metropolis-Hastings proposal is at least the support of the target (or else when the support of the joint defined as target x Metropolis-Hastings proposal distribution is somewhat symmetric. Replacing total variation with Kullback-Leibler then leads to a manageable optimisation target if the proposal is a parameterised independent distribution. With a GAN version when the proposal is not explicitly available. I find it rather strange that one still seeks independent proposals for running Metropolis-Hastings algorithms as the result will depend on the family of proposals considered and as performances will deteriorate with dimension (the authors mention a 10% acceptance rate, which sounds quite low). [As an aside, ICLR 2020 will take part in Addis Abeba next April.]

## Archive for total variation

## an independent sampler that maximizes the acceptance rate of the MH algorithm

Posted in Books, Kids, Statistics, University life with tags accept-reject algorithm, adaptive Monte Carlo algorithm, Addis Abeba, Bayesian GANs, Ethiopia, ICLR 2019, importance sampling, Kullback-Leibler divergence, Monte Carlo Statistical Methods, optimal acceptance rate, optimisation, reversibility, simulation, total variation on September 3, 2019 by xi'an## certified randomness, 187m away…

Posted in Statistics with tags Bell inequality, Nature, quantum computers, random number generation, randomness, RNG, total variation, uniformity test on May 3, 2018 by xi'an**A**s it rarely happens with Nature, I just read an article that directly relates to my research interests, about a secure physical random number generator (RNG). By Peter Bierhost and co-authors, mostly physicists apparently. Security here means that the outcome of the RNG is unpredictable. This very peculiar RNG is based on two correlated photons sent to two measuring stations, separated by at least 187m, which have to display unpredictable outcomes in order to respect the impossibility of faster-than-light communications, otherwise known as Bell inequalities. This is hardly practical though, especially when mentioning that the authors managed to produce 2¹⁰ random bits over 10 minutes, post processing “the measurement of 55 million photon pairs”. (I however fail to see why the two-arm apparatus would be needed for regular random generation as it seems relevant solely for the demonstration of randomness.) I also checked the associated supplementary material, which is mostly about proving some total variation bound, and constructing a Bell function. What is most puzzling in this paper (and the associated supplementary material) is the (apparent) lack of guarantee of uniformity of the RNG. For instance, a sentence (Supplementary Material, p.11) about a distribution being “within TV distance of uniform” hints at the method being not provably uniform, which makes the whole exercise incomprehensible…

## approximations of Markov Chains [another garden of forking paths]

Posted in Books, Mountains, pictures, Statistics, University life with tags approximate MCMC, computational budget, Doeblin's condition, Markov chain Monte Carlo, MCMskv, minimaxity, Monte Carlo Statistical Methods, noisy MCMC, total variation, uniform ergodicity, uniform geometric ergodicity on March 15, 2016 by xi'an**J**ames Johndrow and co-authors from Duke wrote a paper on approximate MCMC that was arXived last August and that I missed. David Dunson‘s talk at MCMski made me aware of it. The paper studies the impact of replacing a valid kernel with a close approximation. Which is a central issue for many usages of MCMC in complex models, as exemplified by the large number of talks on that topic at MCMski.

“All of our bounds improve with the MCMC sample path length at the expected rate in t.”

A major constraint in the paper is Doeblin’s condition, which implies uniform geometric ergodicity. Not only it is a constraint on the Markov kernel but it is also one for the Markov operator in that it may prove impossible to… prove. The second constraint is that the approximate Markov kernel is close enough to the original, which sounds reasonable. Even though one can always worry that the total variation norm is too weak a norm to mean much. For instance, I presume with some confidence that this does not prevent the approximate Markov kernel from not being ergodic, e.g., not irreducible, not absolutely continuous wrt the target, null recurrent or transient. Actually, the assumption is stronger in that there exists a *collection* of approximations for all small enough values ε of the total variation distance. (*Small enough* meaning ε is much smaller than the complement α to 1 of the one step distance between the Markov kernel and the target. With poor kernels, the approximation must thus be *very* good.) This is less realistic than assuming the availability of one single approximation associated with an existing but undetermined distance ε. (For instance, the three examples of Section 3 in the paper show the existence of approximations achieving a certain distance ε, without providing a constructive determination of such approximations.) Under those assumptions, the average of the sequence of Markov moves according to the approximate kernel converges to the target in total variation (and in expectation for bounded functions). With sharp bounds on those distances. I am still a bit worried at the absence of conditions for the approximation to be ergodic.

“…for relatively short path lengths, there should exist a range of values for which aMCMC offers better performance in the compminimax sense.”

The paper also includes computational cost into the picture. Introducing the notion of compminimax error, which is the smallest (total variation) distance among all approximations at a given computational budget. Quite an interesting, innovative, and relevant notion that may however end up being too formal for practical use. And that does not include the time required to construct and calibrate the approximations.

## optimal simulation on a convex set

Posted in R, Statistics with tags convexity, Henri Poincaré, high dimensions, optimal transport, random walk, total variation, Université Paris Dauphine on February 4, 2016 by xi'an**T**his morning, we had a jam session at the maths department of Paris-Dauphine where a few researchers & colleagues of mine presented their field of research to the whole department. Very interesting despite or thanks to the variety of topics, with forays into the three-body problem(s) [and Poincaré‘s mistake], mean fields for Nash equilibrium (or how to exit a movie theatre), approximate losses in machine learning and so on. Somehow, there was some unity as well through randomness, convexity and optimal transport. One talk close to my own interests was obviously the study of simulation within convex sets by Joseph Lehec from Paris-Dauphine [and Sébastien Bubeck & Ronen Eldan] as they had established a total variation convergence result at a speed only increasing polynomially with the dimension. The underlying simulation algorithm is rather theoretical in that it involves random walk (or Langevin corrected) moves where any excursion outside the convex support is replaced with its projection on the set. Projection that may prove pretty expensive to compute if the convex set is defined for instance as the intersection of many hyperplanes. So I do not readily see how the scheme can be recycled into a competitor to a Metropolis-Hastings solution in that the resulting chain hits the boundary from time to time. With the same frequency over iterations. A solution is to instead use Metropolis-Hastings of course, while another one is to bounce on the boundary and then correct by Metropolis-Hastings… The optimal scales in the three different cases are quite different, from √d in the Metropolis-Hastings cases to d√d in the projection case. (I did not follow the bouncing option to the end, as it lacks a normalising constant.) Here is a quick and not particularly helpful comparison of the exploration patterns of both approaches in dimension 50 for the unit sphere and respective scales of 10/d√d [blue] and 1/√d [gold].

## How quickly does randomness appear?

Posted in Statistics, University life with tags convergence, geometric ergodicity, La Recherche, Metropolis-Hastings algorithms, Nice, randomness, spectral gap, total variation, uniform ergodicity on November 10, 2011 by xi'an**T**his was the [slightly off-key] title of the math column in the November issue of *La Recherche*, in any case intriguing enough for me to buy this general public science magazine on the metro platform and to read it immediately while waiting for an uncertain train, thanks to the nth strike of the year on my train line… But this was the occasion for an exposition of the Metropolis algorithm in a general public journal! The column actually originated from a recently published paper by Persi Diaconis, Gilles Lebeaux, and Laurent Michel, *Geometric analysis for the Metropolis algorithm on Lipschitz domain*, in ** Inventiones Mathematicae** [one of the top pure math journals]. The column in

*La Recherche*described the Metropolis algorithm (labelled there a

*random walk on Markov chains*!), alluded to the use of MCMC methods in statistics, told the genesis of the paper [namely the long-term invitation of Persi Diaconis in Nice a few years ago] and briefly explained the convergence result, namely the convergence of the Metropolis algorithm to the stationary measure at a geometric rate, with an application to the non-overlapping balls problem.

**I**f you take a look at the paper, you will see it is a beautiful piece of mathematics, establishing a spectral gap on the Markov operator associated with the Metropolis algorithm and deducing a uniformly geometric convergence [in total variation] for most regular-and-bounded-support distributions. A far from trivial and fairly general result. *La Recherche* however fails to mention the whole corpus of MCMC convergence results obtained in the 1990’s and 2000’s, by many authors, incl. Richard Tweedie, Gareth Roberts, Jeff Rosenthal, Eric Moulines, Gersende Fort, Randal Douc, Kerrie Mengersen, and others…

## An alternative proof of DA convergence

Posted in Statistics with tags Data augmentation, Gibbs, MCMC, simulation, Tanner, total variation, Wong on September 17, 2009 by xi'an**I** came across a curio today while looking at recent postings on arXiv, namely a different proof of the convergence of the Data Augmentation algorithm, more than twenty years after it was proposed by Martin Tanner and Wing Hung Wong in a 1987 JASA paper… The convergence under the positivity condition is of course a direct consequence of the ergodic theorem, as shown for instance by Tierney (1994), but the note by Yaming Yu uses instead a Kullback divergence

and shows as Liu, Wong and Kong do for the variance (Biometrika, 1994) that this divergence is monotonically decreasing in *t*. The proof is interesting in that only functional (i.e., non-ergodic) arguments are used, even though I am a wee surprised at ** IEEE Transactions on Information Theory** publishing this type of arcane mathematics… Note that the above divergence is the “wrong” one in that it measures the divergence from , not from . The convergence thus involves a sequence of divergences rather than a single one. (Of course, this has no consequence on the corollary that the total variation distance goes to zero.)