Archive for convergence of Gibbs samplers

Gibbs for kidds

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , , , , , , on February 12, 2018 by xi'an

 

A chance (?) question on X validated brought me to re-read Gibbs for Kids, 25 years after it was written (by my close friends George and Ed). The originator of the question had difficulties with the implementation, apparently missing the cyclic pattern of the sampler, as in equations (2.3) and (2.4), and with the convergence, which is only processed for a finite support in the American Statistician paper. The paper [which did not appear in American Statistician under this title!, but inspired an animal bredeer, Dan Gianola, to write a “Gibbs for pigs” presentation in 1993 at the 44th Annual Meeting of the European Association for Animal Production, Aarhus, Denmark!!!] most appropriately only contains toy examples since those can be processed and compared to know stationary measures. This is for instance the case for the auto-exponential model

f(x,y) \propto exp(-xy)

which is only defined as a probability density for a compact support. (The paper does not identify the model as a special case of auto-exponential model, which apparently made the originator of the model, Julian Besag in 1974, unhappy, as George and I found out when visiting Bath, where Julian was spending the final year of his life, many years later.) I use the limiting case all the time in class to point out that a Gibbs sampler can be devised and operate without a stationary probability distribution. However, being picky!, I would like to point out that, contrary, to a comment made in the paper, the Gibbs sampler does not “fail” but on the contrary still “converges” in this case, in the sense that a conditional ergodic theorem applies, i.e., the ratio of the frequencies of visits to two sets A and B with finite measure do converge to the ratio of these measures. For instance, running the Gibbs sampler 10⁶ steps and ckecking for the relative frequencies of x’s in (1,2) and (1,3) gives 0.685, versus log(2)/log(3)=0.63, since 1/x is the stationary measure. One important and influential feature of the paper is to stress that proper conditionals do not imply proper joints. George would work much further on that topic, in particular with his PhD student at the time, my friend Jim Hobert.

With regard to the convergence issue, Gibbs for Kids points out to Schervish and Carlin (1990), which came quite early when considering Gelfand and Smith published their initial paper the very same year, but which also adopts a functional approach to convergence, along the paper’s fixed point perspective, somehow complicating the matter. Later papers by Tierney (1994), Besag (1995), and Mengersen and Tweedie (1996) considerably simplified the answer, which is that irreducibility is a necessary and sufficient condition for convergence. (Incidentally, the reference list includes a technical report of mine’s on latent variable model MCMC implementation that never got published.)

a new paradigm for improper priors

Posted in Books, pictures, Statistics, Travel with tags , , , , , , , , on November 6, 2017 by xi'an

Gunnar Taraldsen and co-authors have arXived a short note on using improper priors from a new perspective. Generalising an earlier 2016 paper in JSPI on the same topic. Which both relate to a concept introduced by Rényi (who himself attributes the idea to Kolmogorov). Namely that random variables measures are to be associated with arbitrary measures [not necessarily σ-finite measures, the later defining σ-finite random variables], rather than those with total mass one. Which allows for an alternate notion of conditional probability in the case of σ-finite random variables, with the perk that this conditional probability distribution is itself of mass 1 (a.e.).  Which we know happens when moving from prior to proper posterior.

I remain puzzled by the 2016 paper though as I do not follow the meaning of a random variable associated with an infinite mass probability measure. If the point is limited to construct posterior probability distributions associated with improper priors, there is little value in doing so. The argument in the 2016 paper is however that one can then define a conditional distribution in marginalisation paradoxes à la Stone, Dawid and Zidek (1973) where the marginal does not exist. Solving with this formalism the said marginalisation paradoxes as conditional distributions are only defined for σ-finite random variables. Which gives a fairly different conclusion that either Stone, Dawid and Zidek (1973) [with whom I agree, namely that there is no paradox because there is no “joint” distribution] or Jaynes (1973) [with whom I less agree!, in that the use of an invariant measure to make the discrepancy go away is not a particularly strong argument in favour of this measure]. The 2016 paper also draws an interesting connection with the study by Jim Hobert and George Casella (in Jim’s thesis) of [null recurrent or transient] Gibbs samplers with no joint [proper] distribution. Which in some situations can produce proper subchains, a phenomenon later exhibited by Alan Gelfand and Sujit Sahu (and Xiao-Li Meng as well if I correctly remember!). But I see no advantage in following this formalism, as it does not impact whether the chain is transient or null recurrent, or anything connected with its implementation. Plus a link to the approximation of improper priors by sequences of proper ones by Bioche and Druihlet I discussed a while ago.

pseudo-marginal MCMC with minimal replicas

Posted in Books, Statistics, University life with tags , , , , , , on November 16, 2016 by xi'an

A week ago, Chris Sherlock, Alexandre Thiery and Anthony Lee (Warwick) arXived a short paper where they show that it is most efficient to use only one random substitute to the likelihood when considering a pseudo-marginal algorithm. Thus generalising an earlier paper of Luke Bornn and co-authors I commented earlier. A neat side result in the paper is that the acceptance probability for m replicas [in the pseudo-marginal approximation] is at most m/s the acceptance probability for s replicas when s<m. The main result is as follows:

screenshot_20161114_140345There is a (superficial?) connection with the fact that when running Metropolis-within-Gibbs there is no advantage in doing more than one single Metropolis iteration, as the goal is to converge to the joint posterior, rather than approximating better the full conditional…

inefficiency of data augmentation for large samples

Posted in Books, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , on May 31, 2016 by xi'an

On Monday, James Johndrow, Aaron Smith, Natesh Pillai, and David Dunson arXived a paper on the diminishing benefits of using data augmentation for large and highly imbalanced categorical data. They reconsider the data augmentation scheme of Tanner and Wong (1987), surprisingly not mentioned, used in the first occurrences of the Gibbs sampler like Albert and Chib’s (1993) or our mixture estimation paper with Jean Diebolt (1990). The central difficulty with data augmentation is that the distribution to be simulated operates on a space that is of order O(n), even when the original distribution covers a single parameter. As illustrated by the coalescent in population genetics (and the subsequent intrusion of the ABC methodology), there are well-known cases when the completion is near to impossible and clearly inefficient (as again illustrated by the failure of importance sampling strategies on the coalescent). The paper provides spectral gaps for the logistic and probit regression completions, which are of order a power of log(n) divided by √n, when all observations are equal to one. In a somewhat related paper with Jim Hobert and Vivek Roy, we studied the spectral gap for mixtures with a small number of observations: I wonder at the existence of a similar result in this setting, when all observations stem from one component of the mixture, when all observations are one. The result in this paper is theoretically appealing, the more because the posteriors associated with such models are highly regular and very close to Gaussian (and hence not that challenging as argued by Chopin and Ridgway). And because the data augmentation algorithm is uniformly ergodic in this setting (as we established with Jean Diebolt  and later explored with Richard Tweedie). As demonstrated in the  experiment produced in the paper, when comparing with HMC and Metropolis-Hastings (same computing times?), which produce much higher effective sample sizes.

twilight zone [of statistics]

Posted in Books, pictures, R, Statistics, University life with tags , , , , , , , , , , on February 26, 2016 by xi'an

mixture with unknown means“I have decided that mixtures, like tequila, are inherently evil and should be avoided at all costs.” L. Wasserman

Larry Wasserman once remarked that finite mixtures were like the twilight zone of statistics, thanks to the numerous idiosyncrasies associated with such models. And George Casella had similar strong reservations about mixture estimation. Avi Feller and co-authors [including Natesh Pillai] have just arXived a paper on this topic, exhibiting shocking (!) properties of the MLE! Their core example is a mixture of two normal distributions with known common variance and known weight different from 0.5, which ensures identifiability. This is a favourite example of mine that we used for instance in our book Introducing Monte Carlo methods with R. If only because we can plot the likelihood and posterior surfaces. (Warning: I wrote those notes on an earlier version of the paper, so mileage may vary in terms of accuracy!)

The “shocking” discovery in the paper is that the MLE is wrong as often as not in selecting the sign of the difference Δ between both means, with an additional accumulation point at zero. The global mode may thus be in the wrong place for small enough sample sizes. And even for larger sizes: when the difference between the means is small the likelihood is likely to be unimodal with a mode quite close to zero. (An interesting remark is that the likelihood derivative is always zero at Δ=0 when considering the special case of both means equal to -Δ and to πΔ/(1-π), respectively, which implies that the overall mean of the mixture is equal to zero. A potential connection with our reparameterisation paper, maybe?)

The alternative proposed by Avi and his co-authors is to proceed through moments, i.e., to revert to Pearson (1892). There are however difficulties with this approach, first and foremost the non-uniqueness of the moment equations used to estimate Δ. For instance, the second cumulant equation chosen by the authors is not always defined as opposed to the third cumulant equation (why not using this third cumulant then). Which does not always produce the right sign… But, in a strange twist, the authors turn those deficiencies into signals for both pathologies (wrong sign and “pile-up” at zero).

“…the grid bootstrap yields an exact p-value for any valid test statistic.”

The most importance issue in this framework being in estimating the parameters, the authors opt for an approach based on tests, which is definitely surprising given the well-known deficiencies of standard tests in mixtures. The test chosen here is a Wald test with a statistic equal to the χ² version of the first cumulant differences. I am surprised that the χ² approximation works in such an unfriendly setting. And I do not understand how the grid is used, unless a certain degree of approximation is accepted, which takes us back to the “dark ages” of imposing a minimal distance Δ to achieve consistency, as in Ghosh and Sen (1985).

muminusmu0 muminusmu1

“..our concern about sign error is trivial in the Bayesian setting: the global mode is simply a poor summary of a multi-modal posterior. More broadly, the weak identification issues we highlight in this paper are not necessarily relevant to a strict Bayesian.”

A priori, I do not think pathologies of the MLE always transfer to Bayes estimators, unless one uses the MAP as an [poor] estimator. But using the MAP is not necessary since posterior means are meaningful in this identified setting, where label switching should not occur. However, running the same experiments with a Gaussian prior on both means and using the posterior mean as my estimator, I did obtain the same pathology of Bayes estimates [also produced in the supplementary material] not concentrating on the true value of the difference, but putting weight on the opposite value and at zero. Using a less standard prior inspired by David Rossell’s talk on non-local priors two weeks ago, which avoids a neighbourhood of zero, I did not get a much different picture as illustrated below:

muminusmux0 muminusmux0

Overall, I remain somewhat uncertain as to what to conclude from this pathological behaviour. When both means are close enough, the sign of the difference is often estimated wrongly. But that could simply mean that the means are not significantly different, for that sample size…

scaling the Gibbs posterior credible regions

Posted in Books, Statistics, University life with tags , , , , , , , on September 11, 2015 by xi'an

“The challenge in implementation of the Gibbs posterior is that it depends on an unspecified scale (or inverse temperature) parameter.”

A new paper by Nick Syring and Ryan Martin was arXived today on the same topic as the one I discussed last January. The setting is the same as with empirical likelihood, namely that the distribution of the data is not specified, while parameters of interest are defined via moments or, more generally, a minimising a loss function. A pseudo-likelihood can then be constructed as a substitute to the likelihood, in the spirit of Bissiri et al. (2013). It is called a “Gibbs posterior” distribution in this paper. So the “Gibbs” in the title has no link with the “Gibbs” in Gibbs sampler, since inference is conducted with respect to this pseudo-posterior. Somewhat logically (!), as n grows to infinity, the pseudo- posterior concentrates upon the pseudo-true value of θ minimising the expected loss, hence asymptotically resembles to the M-estimator associated with this criterion. As I pointed out in the discussion of Bissiri et al. (2013), one major hurdle when turning a loss into a log-likelihood is that it is at best defined up to a scale factor ω. The authors choose ω so that the Gibbs posterior

\exp\{-\omega n l_n(\theta,x) \}\pi(\theta)

is well-calibrated. Where ln is the empirical averaged loss. So the Gibbs posterior is part of the matching prior collection. In practice the authors calibrate ω by a stochastic optimisation iterative process, with bootstrap on the side to evaluate coverage. They briefly consider empirical likelihood as an alternative, on a median regression example, where they show that their “Gibbs confidence intervals (…) are clearly the best” (p.12). Apart from the relevance of being “well-calibrated”, and the asymptotic nature of the results. and the dependence on the parameterisation via the loss function, one may also question the possibility of using this approach in large dimensional cases where all of or none of the parameters are of interest.

likelihood-free inference in high-dimensional models

Posted in Books, R, Statistics, University life with tags , , , , , , , , , on September 1, 2015 by xi'an

“…for a general linear model (GLM), a single linear function is a sufficient statistic for each associated parameter…”

Water Tower, Michigan Avenue, Chicago, Oct. 31, 2012The recently arXived paper “Likelihood-free inference in high-dimensional models“, by Kousathanas et al. (July 2015), proposes an ABC resolution of the dimensionality curse [when the dimension of the parameter and of the corresponding summary statistics] by turning Gibbs-like and by using a component-by-component ABC-MCMC update that allows for low dimensional statistics. In the (rare) event there exists a conditional sufficient statistic for each component of the parameter vector, the approach is just as justified as when using a generic ABC-Gibbs method based on the whole data. Otherwise, that is, when using a non-sufficient estimator of the corresponding component (as, e.g., in a generalised [not general!] linear model), the approach is less coherent as there is no joint target associated with the Gibbs moves. One may therefore wonder at the convergence properties of the resulting algorithm. The only safe case [in dimension 2] is when one of the restricted conditionals does not depend on the other parameter. Note also that each Gibbs step a priori requires the simulation of a new pseudo-dataset, which may be a major imposition on computing time. And that setting the tolerance for each parameter is a delicate calibration issue because in principle the tolerance should depend on the other component values. Continue reading