Archive for Gibbs sampling

an introduction to MCMC sampling

Posted in Books, Kids, Statistics with tags , , , , , , , , , on August 9, 2022 by xi'an

Following a rather clueless question on X validated, I had a quick read of A simple introduction to Markov Chain Monte–Carlo sampling, by Ravenzwaaij, Cassey, and Brown, published in 2018 in Psychonomic Bulletin & Review, which I had never opened to this day. The setting is very basic and the authors at pain to make their explanations as simple as possible, but I find the effort somehow backfires under the excess of details. And the characteristic avoidance of mathematical symbols and formulae. For instance, in the Normal mean example that is used as introductory illustration and that confused the question originator, there is no explanation for the posterior being a N(100,15) distribution, 100 being the sample average, the notation N(μ|x,σ) is used for the posterior density, and then the Metropolis comparison brings an added layer of confusion:

“Since the target distribution is normal with mean 100 (the value of the single observation) and standard deviation 15,  this means comparing N(100|108, 15) against N(100|110, 15).”

as it most unfortunately exchanges the positions of  μ and x (which is equal to 100). There is no fundamental error there, due to the symmetry of the Normal density, but this switch from posterior to likelihood certainly contributes to the confusion of the QO. Similarly for the Metropolis step description:

“If the new proposal has a lower posterior value than the most recent sample, then randomly choose to accept or
reject the new proposal, with a probability equal to the height of both posterior values. “

And the shortcomings of MCMC may prove equally difficult to ingest: like
“The method will “work” (i.e., the sampling distribution will truly be the target distribution) as long as certain conditions are met.
Firstly, the likelihood values calculated (…) to accept or reject the new proposal must accurately reflect the density of the proposal in the target distribution. When MCMC is applied to Bayesian inference, this means that the values calculated must be posterior likelihoods, or at least be proportional to the posterior likelihood (i.e., the ratio of the likelihoods calculated relative to one another must be correct).”

which leaves me uncertain as to what the authors do mean by the alternative situation, i.e., by the proposed value not reflecting the proposal density. Again, the reluctance in using (more) formulae hurts the intended pedagogical explanations.

Bayesian sampling without tears

Posted in Books, Kids, R, Statistics with tags , , , , , , , , , , , , on May 24, 2022 by xi'an

Following a question on Stack Overflow trying to replicate a figure from the paper written by Alan Gelfand and Adrian Smith (1990) for The American Statistician, Bayesian sampling without tears, which precedes their historical MCMC papers, I looked at the R code produced by the OP and could not spot an issue as to why their simulation did not fit the posterior produced in the paper. Which proposes acceptance-rejection and sampling-importance-resampling as two solutions to approximately simulate from the posterior. The later being illustrated by simulations from the prior being weighted by the likelihood… The illustration is made of 3 observations from the sum of two Binomials with different success probabilities, θ¹ and θ². With a Uniform prior on both.

for (i in 1:N)
  for (k in 1:3){
    llh<-0
    for (j in max(0,n2[k]-y[k]):min(y[k],n1[k]))
      llh<-llh+choose(n1[k],j)*choose(n2[k],y[k]-j)*
         theta[i,1]^j*(1-theta[i,1])^(n1[k]-j)*theta[i,2]^(y[k]-j)*
         (1-theta[i,2])^(n2[k]-y[k]+j)
    l[i]=l[i]*llh}

To double-check, I also wrote a Gibbs version:

theta=matrix(runif(2),nrow=T,ncol=2)
x1=rep(NA,3)
for(t in 1:(T-1)){
   for(j in 1:3){
     a<-max(0,n2[j]-y[j]):min(y[j],n1[j])
     x1[j]=sample(a,1,
        prob=choose(n1[j],a)*choose(n2[j],y[j]-a)*
           theta[t,1]^a*(1-theta[t,1])^(n1[j]-a)*
           theta[t,2]^(y[j]-a)*(1-theta[t,2])^(n2[j]-y[j]+a)
           )}
      theta[t+1,1]=rbeta(1,sum(x1)+1,sum(n1)-sum(x1)+1)
      theta[t+1,2]=rbeta(1,sum(y)-sum(x1)+1,sum(n2)-sum(y)+sum(x1)+1)}

which did not show any difference with the above. Nor with the likelihood surface.

ordered allocation sampler

Posted in Books, Statistics with tags , , , , , , , , , , , on November 29, 2021 by xi'an

Recently, Pierpaolo De Blasi and María Gil-Leyva arXived a proposal for a novel Gibbs sampler for mixture models. In both finite and infinite mixture models. In connection with Pitman (1996) theory of species sampling and with interesting features in terms of removing the vexing label switching features.

The key idea is to work with the mixture components in the random order of appearance in an exchangeable sequence from the mixing distribution (…) In accordance with the order of appearance, we derive a new Gibbs sampling algorithm that we name the ordered allocation sampler. “

This central idea is thus a reinterpretation of the mixture model as the marginal of the component model when its parameter is distributed as a species sampling variate. An ensuing marginal algorithm is to integrate out the weights and the allocation variables to only consider the non-empty component parameters and the partition function, which are label invariant. Which reminded me of the proposal we made in our 2000 JASA paper with Gilles Celeux and Merrilee Hurn (one of my favourite papers!). And of the [first paper in Statistical Methodology] 2004 partitioned importance sampling version with George Casella and Marty Wells. As in the later, the solution seems to require the prior on the component parameters to be conjugate (as I do not see a way to produce an unbiased estimator of the partition allocation probabilities).

The ordered allocation sample considers the posterior distribution of the different object made of the parameters and of the sequence of allocations to the components for the sample written in a given order, ie y¹,y², &tc. Hence y¹ always gets associated with component 1, y² with either component 1 or component 2, and so on. For this distribution, the full conditionals are available, incl. the full posterior on the number m of components, only depending on the data through the partition sizes and the number m⁺ of non-empty components. (Which relates to the debate as to whether or not m is estimable…) This sequential allocation reminded me as well of an earlier 2007 JRSS paper by Nicolas Chopin. Albeit using particles rather than Gibbs and applied to a hidden Markov model. Funny enough, their synthetic dataset univ4 almost resembles the Galaxy dataset (as in the above picture of mine)!

multilevel linear models, Gibbs samplers, and multigrid decompositions

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , on October 22, 2021 by xi'an

A paper by Giacommo Zanella (formerly Warwick) and Gareth Roberts (Warwick) is about to appear in Bayesian Analysis and (still) open for discussion. It examines in great details the convergence properties of several Gibbs versions of the same hierarchical posterior for an ANOVA type linear model. Although this may sound like an old-timer opinion, I find it good to have Gibbs sampling back on track! And to have further attention to diagnose convergence! Also, even after all these years (!), it is always a surprise  for me to (re-)realise that different versions of Gibbs samplings may hugely differ in convergence properties.

At first, intuitively, I thought the options (1,0) (c) and (0,1) (d) should be similarly performing. But one is “more” hierarchical than the other. While the results exhibiting a theoretical ordering of these choices are impressive, I would suggest pursuing an random exploration of the various parameterisations in order to handle cases where an analytical ordering proves impossible. It would most likely produce a superior performance, as hinted at by Figure 4. (This alternative happens to be briefly mentioned in the Conclusion section.) The notion of choosing the optimal parameterisation at each step is indeed somewhat unrealistic in that the optimality zones exhibited in Figure 4 are unknown in a more general model than the Gaussian ANOVA model. Especially with a high number of parameters, parameterisations, and recombinations in the model (Section 7).

An idle question is about the extension to a more general hierarchical model where recentring is not feasible because of the non-linear nature of the parameters. Even though Gaussianity may not be such a restriction in that other exponential (if artificial) families keeping the ANOVA structure should work as well.

Theorem 1 is quite impressive and wide ranging. It also reminded (old) me of the interleaving properties and data augmentation versions of the early-day Gibbs. More to the point and to the current era, it offers more possibilities for coupling, parallelism, and increasing convergence. And for fighting dimension curses.

“in this context, imposing identifiability always improves the convergence properties of the Gibbs Sampler”

Another idle thought of mine is to wonder whether or not there is a limited number of reparameterisations. I think that by creating unidentifiable decompositions of (some) parameters, eg, μ=μ¹+μ²+.., one can unrestrictedly multiply the number of parameterisations. Instead of imposing hard identifiability constraints as in Section 4.2, my intuition was that this de-identification would increase the mixing behaviour but this somewhat clashes with the above (rigorous) statement from the authors. So I am proven wrong there!

Unless I missed something, I also wonder at different possible implementations of HMC depending on different parameterisations and whether or not the impact of parameterisation has been studied for HMC. (Which may be linked with Remark 2?)

ensemble Metropolis-Hastings

Posted in Books, Kids, Statistics with tags , , , , , on October 14, 2021 by xi'an

A question on X validated about ensemble MCMC samplers had me try twice to justify the Metropolis-Hasting ratio the authors used. To recap, ensemble sampling moves a cloud of points (just like our bouncy particle sampler) one point X at a time by using another point Z as a pivot or origin and moving randomly X along the line [XZ]. In the paper,  the distribution of the rescaling is symmetric in the sense that f(z)=f(1/z). I indeed started by perceiving the basic step of the sampler as a Metropolis-within-Gibbs step along a random direction. But it did not work as the direction depends on the current X. I then wondered at a possible importance sampling interpretation compensating for the change of scale, but it was leading to the wrong power anyway. Before hitting the fact that this was actually a change of radius in the space with origin Z, leaving the angular coordinates invariant. Which explained for the power (n-1) in the Metropolis ratio, in agreement with a switch to polar coordinates.

%d bloggers like this: