**I**n continuation of my earlier post on Bayesian GANs, which resort to strongly incompatible conditionals, I read a 2015 paper of Chen and Ip that I had missed. (Published in the Journal of Statistical Computation and Simulation which I first confused with JCGS and which I do not know at all. Actually, when looking at its editorial board, I recognised only one name.) But the study therein is quite disappointing and not helping as it considers Markov chains on finite state spaces, meaning that the transition distributions are matrices, meaning also that convergence is ensured if these matrices have no null probability term. And while the paper is motivated by realistic situations where incompatible conditionals can reasonably appear, the paper only produces illustrations on two and three states Markov chains. Not that helpful, in the end… The game is still afoot!

## Archive for Markov chains

## Gibbs for incompatible kids

Posted in Books, Statistics, University life with tags Bayesian GANs, convergence of Gibbs samplers, GANs, Gibbs for Kids, Gibbs sampling, irreducibility, JCGS, Markov chains, MCMC algorithms, Monte Carlo Statistical Methods, stationarity on September 27, 2018 by xi'an## adaptive independent Metropolis-Hastings

Posted in Statistics with tags adaptive MCMC, convergence, Doeblin's condition, independent Metropolis-Hastings algorithm, Markov chains, MCMC algorithms, Wolfgang Doeblin on May 8, 2018 by xi'an**W**hen rereading this paper by Halden et al. (2009), I was reminded of the earlier and somewhat under-appreciated Gåsemyr (2003). But I find the convergence results therein rather counter-intuitive in that they seem to justify adaptive independent proposals with no strong requirement. Besides the massive Doeblin condition:

“The Doeblin condition essentially requires that all the proposal distribution [sic] has uniformly heavier tails than the target distribution.”

Even when the adaptation is based on an history vector made of rejected values and non-replicated accepted values. Actually convergence of this sequence of adaptive proposals kernels is established under a concentration of the Doeblin constants a¹,a²,… towards one, in the sense that

**E**[(1-a¹)(1-a²)…]=0.

The reason may be that, with chains satisfying a Doeblin condition, there is a probability to reach stationarity at each step. Equal to a¹, a², … And hence to ignore adaptivity since each kernel keep the target π invariant. So in the end this is not so astounding. (The paper also reminded me of Wolfgang [or Vincent] Doeblin‘s short and tragic life.)

## Gibbs for kidds

Posted in Books, Kids, Statistics, University life with tags Aarhus, animal breeder, auto-exponential model, Bath, convergence of Gibbs samplers, cross validated, Denmark, George Casella, Gibbs for Kids, Gibbs for pigs, Gibbs sampling, irreducibility, Julian Besag, Markov chains, recurrence, The American Statistician on February 12, 2018 by xi'an

**A** chance (?) question on X validated brought me to re-read Gibbs for Kids, 25 years after it was written (by my close friends George and Ed). The originator of the question had difficulties with the implementation, apparently missing the cyclic pattern of the sampler, as in equations (2.3) and (2.4), and with the convergence, which is only processed for a finite support in the American Statistician paper. The paper [which did not appear in American Statistician under this title!, but inspired an animal bredeer, Dan Gianola, to write a “Gibbs for pigs” presentation in 1993 at the 44th Annual Meeting of the European Association for Animal Production, Aarhus, Denmark!!!] most appropriately only contains toy examples since those can be processed and compared to know stationary measures. This is for instance the case for the auto-exponential model

which is only defined as a probability density for a compact support. (The paper does not identify the model as a special case of auto-exponential model, which apparently made the originator of the model, Julian Besag in 1974, unhappy, as George and I found out when visiting Bath, where Julian was spending the final year of his life, many years later.) I use the limiting case all the time in class to point out that a Gibbs sampler can be devised and operate without a stationary probability distribution. However, being picky!, I would like to point out that, contrary, to a comment made in the paper, the Gibbs sampler does not “fail” but on the contrary still “converges” in this case, in the sense that a conditional ergodic theorem applies, i.e., the ratio of the frequencies of visits to two sets A and B with finite measure do converge to the ratio of these measures. For instance, running the Gibbs sampler 10⁶ steps and ckecking for the relative frequencies of x’s in (1,2) and (1,3) gives 0.685, versus log(2)/log(3)=0.63, since 1/x is the stationary measure. One important and influential feature of the paper is to stress that proper conditionals do not imply proper joints. George would work much further on that topic, in particular with his PhD student at the time, my friend Jim Hobert.

With regard to the convergence issue, Gibbs for Kids points out to Schervish and Carlin (1990), which came quite early when considering Gelfand and Smith published their initial paper the very same year, but which also adopts a functional approach to convergence, along the paper’s fixed point perspective, somehow complicating the matter. Later papers by Tierney (1994), Besag (1995), and Mengersen and Tweedie (1996) considerably simplified the answer, which is that *irreducibility* is a necessary and sufficient condition for convergence. (Incidentally, the reference list includes a technical report of mine’s on latent variable model MCMC implementation that never got published.)

## flea circus

Posted in Books, Kids, pictures, R, Statistics with tags coupling, cross validated, fleas, Leonhard Euler, Markov chains, Markov random field, Monte Carlo integration, occupancy, Poisson approximation, R, random walk, stationarity on December 8, 2016 by xi'an**A**n old riddle found on X validated asking for Monte Carlo resolution but originally given on Project Euler:

A 30×30 grid of squares contains 30² fleas, initially one flea per square. When a bell is rung, each flea jumps to an adjacent square at random. What is the expected number of unoccupied squares after 50 bell rings, up to six decimal places?

The debate on X validated is whether or not a Monte Carlo resolution is feasible. Up to six decimals, certainly not. But with some lower precision, certainly. Here is a rather basic R code where the 50 steps are operated on the 900 squares, rather than the 900 fleas. This saves some time by avoiding empty squares.

xprmt=function(n=10,T=50){ mean=0 for (t in 1:n){ board=rep(1,900) for (v in 1:T){ beard=rep(0,900) if (board[1]>0){ poz=c(0,1,0,30) ne=rmultinom(1,board[1],prob=(poz!=0)) beard[1+poz]=beard[1+poz]+ne} # for (i in (2:899)[board[-1][-899]>0]){ u=(i-1)%%30+1;v=(i-1)%/%30+1 poz=c(-(u>1),(u<30),-30*(v>1),30*(v<30)) ne=rmultinom(1,board[i],prob=(poz!=0)) beard[i+poz]=beard[i+poz]+ne} # if (board[900]>0){ poz=c(-1,0,-30,0) ne=rmultinom(1,board[900],prob=(poz!=0)) beard[900+poz]=beard[900+poz]+ne} board=beard} mean=mean+sum(board==0)} return(mean/n)}

The function returns an empirical average over n replications. With a presumably awkward approach to the borderline squares, since it involves adding zeros to keep the structure the same… Nonetheless, it produces an approximation that is rather close to the approximate expected value, in about 3mn on my laptop.

> exprmt(n=1e3) [1] 331.082 > 900/exp(1) [1] 331.0915

Further gains follow from considering only half of the squares, as there are two independent processes acting in parallel. I looked at an alternative and much faster approach using the stationary distribution, with the stationary being the Multinomial (450,(2/1740,3/1740…,4/1740,…,2/1740)) with probabilities proportional to 2 in the corner, 3 on the sides, and 4 in the inside. (The process, strictly speaking, has *no stationary distribution*, since it is periodic. But one can consider instead the subprocess indexed by even times.) This seems to be the case, though, when looking at the occupancy frequencies, after defining the stationary as:

inva=function(B=30){ return(c(2,rep(3,B-2),2,rep(c(3, rep(4,B-2),3),B-2),2,rep(3,B-2),2))}

namely

> mn=0;n=1e8 #14 clock hours! > proz=rep(c(rep(c(0,1),15),rep(c(1,0),15)),15)*inva(30) > for (t in 1:n) + mn=mn+table(rmultinom(1,450,prob=rep(1,450)))[1:4] > mn=mn/n > mn[1]=mn[1]-450 > mn 0 1 2 3 166.11 164.92 82.56 27.71 > exprmt(n=1e6) #55 clock hours!! [1] 165.36 165.69 82.92 27.57

my original confusion being that the Poisson approximation had not yet taken over… (Of course, computing the first frequency for the stationary distribution does not require any simulation, since it is the sum of the complement probabilities to the power 450, i.e., 166.1069.)

## debunking a (minor and personal) myth

Posted in Books, Kids, R, Statistics, University life with tags adaptive MCMC methods, beta distribution, Dirichlet prior, Gaussian random walk, Markov chains on September 9, 2015 by xi'an**F**or quite a while, I entertained the idea that Beta and Dirichlet proposals were more adequate than (log-)normal random walks proposals for parameters on (0,1) and simplicia (simplices, simplexes), respectively, when running an MCMC. For instance, for p in (0,1) the value of the Markov chain at time t-1, the proposal at time t could be a Be(εp,ε{1-p}) generator, since its mean is equal to p and its variance is proportional to 1/(1+ε). (Although I cannot find track of this notion in my books.) The parameter ε can be calibrated towards a given acceptance rate, like the golden number 0.234 of Gelman, Gilks and Roberts (1996). However, when using this proposal on a mixture model, Kaniav Kamari and myself realised today that there is a catch, namely that pushing ε down to achieve an acceptance rate near 0.234 may end up in disaster, since the parameters of the Beta or of the Dirichlet may become lower than 1, which implies an infinite explosion on some boundaries of the parameter space. An explosion that gets more and more serious as ε decreases to zero. Hence is more and more likely to decrease the acceptance rate, thus to reduce ε, which in turns concentrates even more the support on the boundary and leads to a vicious circle and no convergence to the target acceptance rate… Continue reading

## light and widely applicable MCMC: approximate Bayesian inference for large datasets

Posted in Books, Statistics, University life, Wines with tags ABC, big data, character recognition, delayed acceptance, Dublin, Ireland, Markov chains, MCMC algorithm, reversible jump MCMC, Russian roulette, subsampling on March 24, 2015 by xi'an**F**lorian Maire (whose thesis was discussed in this post), Nial Friel, and Pierre Alquier (all in Dublin at some point) have arXived today a paper with the above title, aimed at quickly analysing large datasets. As reviewed in the early pages of the paper, this proposal follows a growing number of techniques advanced in the past years, like pseudo-marginals, Russian roulette, unbiased likelihood estimators. firefly Monte Carlo, adaptive subsampling, sub-likelihoods, telescoping debiased likelihood version, and even our very own delayed acceptance algorithm. (Which is incorrectly described as restricted to iid data, by the way!)

The lightweight approach is based on an ABC idea of working through a summary statistic that plays the role of a pseudo-sufficient statistic. The main theoretical result in the paper is indeed that, when subsampling in an exponential family, subsamples preserving the sufficient statistics (modulo a rescaling) are optimal in terms of distance to the true posterior. Subsamples are thus weighted in terms of the (transformed) difference between the full data statistic and the subsample statistic, assuming they are both normalised to be comparable. I am quite (positively) intrigued by this idea in that it allows to somewhat compare inference based on two different samples. The weights of the subsets are then used in a pseudo-posterior that treats the subset as an auxiliary variable (and the weight as a substitute to the “missing” likelihood). This may sound a wee bit convoluted (!) but the algorithm description is not yet complete: simulating jointly from this pseudo-target is impossible because of the huge number of possible subsets. The authors thus suggest to run an MCMC scheme targeting this joint distribution, with a proposed move on the set of subsets and a proposed move on the parameter set conditional on whether or not the proposed subset has been accepted.

From an ABC perspective, the difficulty in calibrating the tolerance ε sounds more accute than usual, as the size of the subset comes as an additional computing parameter. Bootstrapping options seem impossible to implement in a large size setting.

An MCMC issue with this proposal is that designing the move across the subset space is both paramount for its convergence properties and lacking in geometric intuition. Indeed, two subsets with similar summary statistics may be very far apart… Funny enough, in the representation of the joint Markov chain, the parameter subchain is secondary if crucial to avoid intractable normalising constants. It is also unclear for me from reading the paper maybe too quickly whether or not the separate moves when switching and when not switching subsets retain the proper balance condition for the pseudo-joint to still be the stationary distribution. The stationarity for the subset Markov chain is straightforward by design, but it is not so for the parameter. In case of switched subset, simulating from the true full conditional given the subset would work, but not simulated by a fixed number L of MCMC steps.

The lightweight technology therein shows its muscles on an handwritten digit recognition example where it beats regular MCMC by a factor of 10 to 20, using only 100 datapoints instead of the 10⁴ original datapoints. While very nice and realistic, this example may be misleading in that 100 digit realisations may be enough to find a tolerable approximation to the true MAP. I was also intrigued by the processing of the probit example, until I realised the authors had integrated the covariate out and inferred about the mean of that covariate, which means it is not a genuine probit model.