## séminaire parisien de statistique [09/01/23]

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , , , , on January 22, 2023 by xi'an

I had missed the séminaire parisien de statistique for most of the Fall semester, hence was determined to attend the first session of the year 2023, the more because the talks were close to my interest. To wit, Chiara Amorino spoke about particle systems for McKean-Vlasov SDEs, when those are parameterised by several parameters, when observing repeatedly discretised versions, hereby establishing the consistence of a contrast estimator of these estimators. I was initially confused by the mention of interacting particles, since the work is not at all about related with simulation. Just wondering whether this contrast could prove useful for a likelihood-free approach in building a Gibbs distribution?

Valentin de Bortoli then spoke on diffusion Schrödinger bridges for generative models, which allowed me to better my understanding of this idea presented by Arnaud at the Flatiron workshop last November. The presentation here was quite different, using a forward versus backward explanation via a sequence of transforms that end up approximately Gaussian, once more reminiscent of sequential Monte Carlo. The transforms are themselves approximate Gaussian versions relying on adiscretised Ornstein-Ulhenbeck process, with a missing score term since said score involves a marginal density at each step of the sequence. It can be represented [as below] as an expectation conditional on the (observed) variate at time zero (with a connection with Hyvärinen’s NCE / score matching!) Practical implementation is done via neural networks.

Last but not least!, my friend Randal talked about his Kick-Kac formula, which connects with the one we considered in our 2004 paper with Jim Hobert. While I had heard earlier version, this talk was mostly on probability aspects and highly enjoyable as he included some short proofs. The formula is expressing the stationary probability measure π of the original Markov chain in terms of explorations between two visits to an accessible set C, more general than a small set. With at first an annoying remaining term due to the set not being Harris recurrent but which eventually cancels out. Memoryless transportation can be implemented because C is free for the picking, for instance the set where the target is bounded by a manageable density, allowing for an accept-reject step. The resulting chain is non-reversible. However, due to the difficulty to simulate from the target restricted to C, a second and parallel Markov chain is instead created. Performances, unsurprisingly, depend on the choice of C, but it can be adapted to the target on the go.

## of first importance

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , , , , on June 14, 2022 by xi'an

My PhD student Charly Andral came with the question of the birthdate of importance sampling. I was under the impression that it had been created at the same time as the plain Monte Carlo method, being essentially the same thing since

$\int_{\mathfrak X} h(x)f(x)\,\text dx = \int_{\mathfrak X} h(x)\frac{f(x)}{g(x)}g(x)\,\text dx$

hence due to von Neumann or Ulam, but he could not find a reference earlier than a 1949 proceeding publication by Hermann Kahn in a seminar on scientific computation run by IBM. Despite writing a series of Monte Carlo papers in the late 1940’s and 1950’s, Kahn is not well-known in these circles (although mentioned in Fishman’s book), while being popular to some extent for his theorisation of nuclear war escalation and deterence. (I wonder if the concept is developed in some of his earlier 1948 papers. In a 1951 paper with Goertzel, a footnote signals than the approach was called quota sampling in their earlier papers. Charly has actually traced the earliest proposal as being Kahn’s, in a 14 June 1949 RAND preprint, beating Goertzel’s Oak Ridge National Laboratory preprint on quota sampling and importance functions by five days.)

(As a further marginalia, Kahn wrote with T.E. Harris an earlier preprint on Monte Carlo methods in April 1949, the same Harris as in Harris recurrence.)

## slice sampling revisited

Posted in Books, pictures, Statistics with tags , , , , , , , , on April 15, 2016 by xi'an

Thanks to an X validated question, I re-read Radford Neal’s 2003 Slice sampling paper. Which is an Annals of Statistics discussion paper, and rightly so. While I was involved in the editorial processing of this massive paper (!), I had only vague memories left about it. Slice sampling has this appealing feature of being the equivalent of random walk Metropolis-Hastings for Gibbs sampling, without the drawback of setting a scale for the moves.

“These slice sampling methods can adaptively change the scale of changes made, which makes them easier to tune than Metropolis methods and also avoids problems that arise when the appropriate scale of changes varies over the distribution  (…) Slice sampling methods that improve sampling by suppressing random walks can also be constructed.” (p.706)

One major theme in the paper is fighting random walk behaviour, of which Radford is a strong proponent. Even at the present time, I am a bit surprised by this feature as component-wise slice sampling is exhibiting clear features of a random walk, exploring the subgraph of the target by random vertical and horizontal moves. Hence facing the potential drawback of backtracking to previously visited places.

“A Markov chain consisting solely of overrelaxed updates might not be ergodic.” (p.729)

Overrelaxation is presented as a mean to avoid the random walk behaviour by removing rejections. The proposal is actually deterministic projecting the current value to the “other side” of the approximate slice. If it stays within the slice it is accepted. This “reflection principle” [in that it takes the symmetric wrt the centre of the slice] is also connected with antithetic sampling in that it induces rather negative correlation between the successive simulations. The last methodological section covers reflective slice sampling, which appears as a slice version of Hamiltonian Monte Carlo (HMC). Given the difficulty in implementing exact HMC (reflected in the later literature), it is no wonder that Radford proposes an approximation scheme that is valid if somewhat involved.

“We can show invariance of this distribution by showing (…) detailed balance, which for a uniform distribution reduces to showing that the probability density for x¹ to be selected as the next state, given that the current state is x0, is the same as the probability density for x⁰ to be the next state, given that x¹ is the current state, for any states x⁰ and x¹ within [the slice] S.” (p.718)

In direct connection with the X validated question there is a whole section of the paper on implementing single-variable slice sampling that I had completely forgotten, with a collection of practical implementations when the slice

S={x; u < f(x) }

cannot be computed in an exact manner. Like the “stepping out” procedure. The resulting set (interval) where the uniform simulation in x takes place may well miss some connected component(s) of the slice. This quote may sound like a strange argument in that the move may well leave a part of the slice off and still satisfy this condition. Not really since it states that it must hold for any pair of states within S… The very positive side of this section is to allow for slice sampling in cases where the inversion of u < f(x) is intractable. Hence with a strong practical implication. The multivariate extension of the approximation procedure is more (potentially) fraught with danger in that it may fell victim to a curse of dimension, in that the box for the uniform simulation of x may be much too large when compared with the true slice (or slice of the slice). I had more of a memory of the “trail of crumbs” idea, mostly because of the name I am afraid!, which links with delayed rejection, as indicated in the paper, but seems awfully delicate to calibrate.

## lemma 7.3

Posted in Statistics with tags , , , , , , , , , , , on November 14, 2012 by xi'an

As Xiao-Li Meng accepted to review—and I am quite grateful he managed to fit this review in an already overflowing deanesque schedule!— our 2004 book  Monte Carlo Statistical Methods as part of a special book review issue of CHANCE honouring the memory of George thru his books—thanks to Sam Behseta for suggesting this!—, he sent me the following email about one of our proofs—demonstrating how much efforts he had put into this review!—:

I however have a question about the proof of Lemma 7.3
on page 273. After the expression of
E[h(x^(1)|x_0], the proof stated "and substitute
Eh(x) for h(x_1)".  I cannot think of any
justification for this substitution, given the whole
purpose is to show h(x) is a constant.

I put it on hold for a while and only looked at it in the (long) flight to Chicago. Lemma 7.3 in Monte Carlo Statistical Methods is the result that the Metropolis-Hastings algorithm is Harris recurrent (and not only recurrent). The proof is based on the characterisation of Harris recurrence as having only constants for harmonic functions, i.e. those satisfying the identity

$h(x) = \mathbb{E}[h(X_t)|X_{t-1}=x]$

The chain being recurrent, the above implies that harmonic functions are almost everywhere constant and the proof steps from almost everywhere to everywhere. The fact that the substitution above—and I also stumbled upon that very subtlety when re-reading the proof in my plane seat!—is valid is due to the fact that it occurs within an integral: despite sounding like using the result to prove the result, the argument is thus valid! Needless to say, we did not invent this (elegant) proof but took it from one of the early works on the theory of Metropolis-Hastings algorithms, presumably Luke Tierney’s foundational Annals paper work that we should have quoted…

As pointed out by Xiao-Li, the proof is also confusing for the use of two notations for the expectation (one of which is indexed by f and the other corresponding to the Markov transition) and for the change in the meaning of f, now the stationary density, when compared with Theorem 6.80.