## Metropolis in 95 characters

Posted in pictures, R, Statistics, Travel with tags , , , , , , , , on January 2, 2020 by xi'an Here is an R function that produces a Metropolis-Hastings sample for the univariate log-target f when the later is defined outside as another function. And when using a Gaussian random walk with scale one as proposal. (Inspired from a X validated question.)

```m<-function(T,y=rnorm(1))ifelse(rep(T>1,T),
c(y*{f({z<-m(T-1)})-f(y+z)<rexp(1)}+z,z),y)
```

The function is definitely not optimal, crashes for values of T larger than 580 (unless one modifies the stack size), and operates the most basic version of a Metropolis-Hastings algorithm. But as a codegolf challenge (on a late plane ride), this was a fun exercise.

## random wake

Posted in Books, Kids, R, Statistics with tags , , , , , , on December 27, 2017 by xi'an Just too often on X validated, one sees questions displaying a complete ignorance of the basics that makes one purposelessly wonder what is the point of trying to implement advanced methods when missing the necessary background. And just as often, I reacted to the question by wondering out loud about this… In the current case, the question was about debugging an R code for a mixture of two exponential distributions and the random walk Metropolis algorithm that comes with it. Except that the Normal noise was replaced with a Uniform U(0,1) noise, leading to a most obviously transient Markov chain. I eventually corrected the R code, which returned a rather nicely label-switching output. And which did not necessarily help with the comprehension of the fundamentals.

## normal variates in Metropolis step

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , , , on November 14, 2017 by xi'an A definitely puzzled participant on X validated, confusing the Normal variate or variable used in the random walk Metropolis-Hastings step with its Normal density… It took some cumulated efforts to point out the distinction. Especially as the originator of the question had a rather strong a priori about his or her background:

“I take issue with your assumption that advice on the Metropolis Algorithm is useless to me because of my ignorance of variates. I am currently taking an experimental course on Bayesian data inference and I’m enjoying it very much, i believe i have a relatively good understanding of the algorithm, but i was unclear about this specific.”

despite pondering the meaning of the call to rnorm(1)… I will keep this question in store to use in class when I teach Metropolis-Hastings in a couple of weeks.

## puzzled by harmony [not!]

Posted in Books, Kids, Mountains, pictures, R, Running, Statistics, Travel with tags , , , , , on December 13, 2016 by xi'an In answering yet another question on X validated about the numerical approximation of the marginal likelihood, I suggested using an harmonic mean estimate as a simple but worthless solution based on an MCMC posterior sample. This was on a toy example with a uniform prior on (0,π) and a “likelihood” equal to sin(θ) [really a toy problem!]. Simulating an MCMC chain by a random walk Metropolis-Hastings algorithm is straightforward, as is returning the harmonic mean of the sin(θ)’s.

``````f <- function(x){
if ((0<x)&(x<pi)){
return(sin(x))}else{
return(0)}}

n = 2000 #number of iterations
sigma = 0.5
x = runif(1,0,pi) #initial x value
chain = fx = f(x)
#generates an array of random x values from norm distribution
rands = rnorm(n,0, sigma)
#Metropolis - Hastings algorithm
for (i in 2:n){
can = x + rands[i]  #candidate for jump
fcan=f(can)
aprob = fcan/fx #acceptance probability
if (runif(1) < aprob){
x = can
fx = fcan}
chain=c(chain,fx)}
I = pi*length(chain)/sum(1/chain) #integral harmonic approximation``````

However, the outcome looks remarkably stable and close to the expected value 2/π, despite 1/sin(θ) having an infinite integral on (0,π). Meaning that the average of the 1/sin(θ)’s has no variance. Hence I wonder why this specific example does not lead to an unreliable output… But re-running the chain with a smaller scale σ starts producing values of sin(θ) regularly closer to zero, which leads to an estimate of I both farther away from 2 and much more variable. No miracle, in the end!

## two correlated pseudo-marginals for the price of one!

Posted in Books, Statistics, University life with tags , , , , , , , , , on November 30, 2015 by xi'an

Within two days, two papers on using correlated auxiliary random variables for pseudo-marginal Metropolis-Hastings, one by George Deligiannidis, Arnaud Doucet, Michael Pitt, and Robert Kohn, and another one by Johan Dahlin, Fredrik Lindsten, Joel Kronander, and Thomas Schön! They both make use of the Crank-Nicholson proposal on the auxiliary variables, which is a shrunk Gaussian random walk or an autoregressive model of sorts, and possibly transform these auxiliary variables by inverse cdf or something else.

“Experimentally, the efficiency of computations is increased relative to the standard pseudo-marginal algorithm by up to 180-fold in our simulations.” Deligiannidis et al.

The first paper by Deligiannidis et al. aims at reducing the variance of the Metropolis-Hastings acceptance ratio by correlating the auxiliary variables. While the auxiliary variable can be of any dimension, all that matters is its transform into a (univariate) estimate of the density, used as pseudo-marginal at each iteration. However, if a Markov kernel is used for proposing new auxiliary variables, the sequence of the pseudo-marginal is no longer a Markov chain. Which implies looking at the auxiliary variables. Nonetheless, they manage to derive a CLT on the log pseudo-marginal estimate, for a latent variable model with sample size growing to infinity. Based on this CLT, they control the correlation in the Crank-Nicholson proposal so that the asymptotic variance  is of order one:  this correlation has to converge to 1 as 1-exp(-χN/T), where N is the number of Monte Carlo samples for T time intervals. Those results extend to the bootstrap particle filter. Upon reflection, it makes much sense aiming for a high correlation since the unbiased estimator of the target hardly changes from one iteration to the next (but needs to move for the method to be validated by Metropolis-Hastings arguments). The two simulation experiments showed massive gains following this scheme, as reported in the above quote.

“[ABC] can be used to approximate the log-likelihood using importance sampling and particle filtering. However, in practice these estimates suffer from a large variance with results in bad mixing for the Markov chain.” Dahlin et al.

In the second paper, which came a day later, presumably induced by the first paper, acknowledged from the start, the authors also make use of the Crank-Nicholson proposal on the auxiliary variables, which is a shrunk Gaussian random walk, and possibly transform these auxiliary variables by inverse cdf or something else. The central part of the paper is about tuning the scale in the Crank-Nicholson proposal, in the spirit of Gelman, Gilks and Roberts (1996). Since all that matters is the (univariate) estimate of the density used as pseudo-marginal, The authors approximate the law of the log-density by a Gaussian distribution, despite the difficulty with the “projected” Markov chain, thus looking for the optimal scaling but only achieving a numerical optimisation rather than the equivalent of the golden number of MCMC acceptance probabilities, 0.234. Although in a sense the above should be the goal for the auxiliary variable acceptance rate, when those are of high enough dimension. One thing I could not find in this paper is how much (generic) improvement is gathered from this modification from an iid version. (Another is linked with the above quote, which I find difficult to understand as ABC is not primarily intended as a pseudo-marginal method. In a sense it is the worst possible pseudo-marginal estimator in that it uses estimators taking values in {0,1}…)

## debunking a (minor and personal) myth

Posted in Books, Kids, R, Statistics, University life with tags , , , , on September 9, 2015 by xi'an For quite a while, I entertained the idea that Beta and Dirichlet proposals  were more adequate than (log-)normal random walks proposals for parameters on (0,1) and simplicia (simplices, simplexes), respectively, when running an MCMC. For instance, for p in (0,1) the value of the Markov chain at time t-1, the proposal at time t could be a Be(εp,ε{1-p}) generator, since its mean is equal to p and its variance is proportional to 1/(1+ε). (Although I cannot find track of this notion in my books.) The parameter ε can be calibrated towards a given acceptance rate, like the golden number 0.234 of Gelman, Gilks and Roberts (1996). However, when using this proposal on a mixture model, Kaniav Kamari and myself realised today that there is a catch, namely that pushing ε down to achieve an acceptance rate near 0.234 may end up in disaster, since the parameters of the Beta or of the Dirichlet may become lower than 1, which implies an infinite explosion on some boundaries of the parameter space. An explosion that gets more and more serious as ε decreases to zero. Hence is more and more likely to decrease the acceptance rate, thus to reduce ε, which in turns concentrates even more the support on the boundary and leads to a vicious circle and no convergence to the target acceptance rate… Continue reading

Posted in Books, Statistics, University life with tags , , , , , , on May 15, 2015 by xi'an

[Here is a reply by Pierre Druihlet to my comments on his paper.]

There are several goals in the paper, the last one being the most important one.

The first one is to insist that considering θ as a parameter is not appropriate. We are in complete agreement on that point, but I prefer considering l(θ) as the parameter rather than N, mainly because it is much simpler. Knowing N, the law of l(θ) is given by the law of a random walk with 0 as reflexive boundary (Jaynes in his book, explores this link). So for a given prior on N, we can derive a prior on l(θ). Since the random process that generate N is completely unknown, except that N is probably large, the true law of l(θ) is completely unknown, so we may consider l(θ).

The second one is to state explicitly that a flat prior on θ implies an exponentially increasing prior on l(θ). As an anecdote, Stone, in 1972, warned against this kind of prior for Gaussian models. Another interesting anecdote is that he cited the novel by Abbot “Flatland : a romance of many dimension” who described a world where the dimension is changed. This is exactly the case in the FP since θ has to be seen in two dimensions rather than in one dimension.

The third one is to make a distinction between randomness of the parameter and prior distribution, each one having its own rule. This point is extensively discussed in Section 2.3.
– In the intuitive reasoning, the probability of no annihilation involves the true joint distribution on (θ, x) and therefore the true unknown distribution of θ,.
– In the Bayesian reasoning, the posterior probability of no annihilation is derived from the prior distribution which is improper. The underlying idea is that a prior distribution does not obey probability rules but belongs to a projective space of measure. This is especially true if the prior does not represent an accurate knowledge. In that case, there is no discontinuity between proper and improper priors and therefore the impropriety of the distribution is not a key point. In that context, the joint and marginal distributions are irrelevant, not because the prior is improper, but because it is a prior and not a true law. If the prior were the true probability law of θ,, then the flat distribution could not be considered as a limit of probability distributions.

For most applications, the distinction between prior and probability law is not necessary and even pedantic, but it may appear essential in some situations. For example, in the Jeffreys-Lindley paradox, we may note that the construction of the prior is not compatible with the projective space structure.