Archive for hierarchical Bayesian modelling

latent nested nonparametric priors

Posted in Books, Statistics with tags , , , , , , , on September 23, 2019 by xi'an

A paper on an extended type of non-parametric priors by Camerlenghi et al. [all good friends!] is about to appear in Bayesian Analysis, with a discussion open for contributions (until October 15). While a fairly theoretical piece of work, it validates a Bayesian approach for non-parametric clustering of separate populations with, broadly speaking, common clusters. More formally, it constructs a new family of models that allows for a partial or complete equality between two probability measures, but does not force full identity when the associated samples do share some common observations. Indeed, the more traditional structures prohibit one or the other, from the Dirichlet process (DP) prohibiting two probability measure realisations from being equal or partly equal to some hierarchical DP (HDP) already allowing for common atoms across measure realisations, but prohibiting complete identity between two realised distributions, to nested DP offering one extra level of randomness, but with an infinity of DP realisations that prohibits common atomic support besides completely identical support (and hence distribution).

The current paper imagines two realisations of random measures written as a sum of a common random measure and of one of two separate almost independent random measures: (14) is the core formula of the paper that allows for partial or total equality. An extension to a setting larger than facing two samples seems complicated if only because of the number of common measures one has to introduce, from the totally common measure to measures that are only shared by a subset of the samples. Except in the simplified framework when a single and universally common measure is adopted (with enough justification). The randomness of the model is handled via different completely random measures that involved something like four degrees of hierarchy in the Bayesian model.

Since the example is somewhat central to the paper, the case of one or rather two two-component Normal mixtures with a common component (but with different mixture weights) is handled by the approach, although it seems that it was already covered by HDP. Having exactly the same term (i.e., with the very same weight) is not, but this may be less interesting in real life applications. Note that alternative & easily constructed & parametric constructs are already available in this specific case, involving a limited prior input and a lighter computational burden, although the  Gibbs sampler behind the model proves extremely simple on the paper. (One may wonder at the robustness of the sampler once the case of identical distributions is visited.)

Due to the combinatoric explosion associated with a higher number of observed samples, despite obvious practical situations,  one may wonder at any feasible (and possibly sequential) extension, that would further keep a coherence under marginalisation (in the number of samples). And also whether or not multiple testing could be coherently envisioned in this setting, for instance when handling all hospitals in the UK. Another consistency question covers the Bayes factor used to assess whether the two distributions behind the samples are or not identical. (One may wonder at the importance of the question, hopefully applied to more relevant dataset than the Iris data!)

ABC with Gibbs steps

Posted in Statistics with tags , , , , , , , , , , , , , , , , , on June 3, 2019 by xi'an

With Grégoire Clarté, Robin Ryder and Julien Stoehr, all from Paris-Dauphine, we have just arXived a paper on the specifics of ABC-Gibbs, which is a version of ABC where the generic ABC accept-reject step is replaced by a sequence of n conditional ABC accept-reject steps, each aiming at an ABC version of a conditional distribution extracted from the joint and intractable target. Hence an ABC version of the standard Gibbs sampler. What makes it so special is that each conditional can (and should) be conditioning on a different statistic in order to decrease the dimension of this statistic, ideally down to the dimension of the corresponding component of the parameter. This successfully bypasses the curse of dimensionality but immediately meets with two difficulties. The first one is that the resulting sequence of conditionals is not coherent, since it is not a Gibbs sampler on the ABC target. The conditionals are thus incompatible and therefore convergence of the associated Markov chain becomes an issue. We produce sufficient conditions for the Gibbs sampler to converge to a stationary distribution using incompatible conditionals. The second problem is then that, provided it exists, the limiting and also intractable distribution does not enjoy a Bayesian interpretation, hence may fail to be justified from an inferential viewpoint. We however succeed in producing a version of ABC-Gibbs in a hierarchical model where the limiting distribution can be explicited and even better can be weighted towards recovering the original target. (At least with limiting zero tolerance.)

same risk, different estimators

Posted in Statistics with tags , , , , , on November 10, 2017 by xi'an

An interesting question on X validated reminded me of the epiphany I had some twenty years ago when reading a Annals of Statistics paper by Anirban Das Gupta and Bill Strawderman on shrinkage estimators, namely that some estimators shared the same risk function, meaning their integrated loss was the same for all values of the parameter. As indicated in this question, Stefan‘s instructor seems to believe that two estimators having the same risk function must be a.s. identical. Which is not true as exemplified by the James-Stein (1960) estimator with scale 2(p-2), which has constant risk p, just like the maximum likelihood estimator. I presume the confusion stemmed from the concept of completeness, where having a function with constant expectation under all values of the parameter implies that this function is constant. But, for loss functions, the concept does not apply since the loss depends both on the observation (that is complete in a Normal model) and on the parameter.

Nonparametric hierarchical Bayesian quantiles

Posted in Books, Statistics, University life with tags , , , , , , , on June 9, 2016 by xi'an

Luke Bornn, Neal Shephard and Reza Solgi have recently arXived a research report on non-parametric Bayesian quantiles. This work relates to their earlier paper that combines Bayesian inference with moment estimators, in that the quantiles do not define entirely the distribution of the data, which then needs to be completed by Bayesian means. But contrary to this previous paper, it does not require MCMC simulation for distributions defined on a variety as, e.g., a curve.

Here a quantile is defined as minimising an asymmetric absolute risk, i.e., an expected loss. It is therefore a deterministic function of the model parameters for a parametric model and a functional of the model otherwise. And connected to a moment if not a moment per se. In the case of a model with a discrete support, the unconstrained model is parameterised by the probability vector θ and β=t(θ). However, the authors study the opposite approach, namely to set a prior on β, p(β), and then complement this prior with a conditional prior on θ, p(θ|β), the joint prior p(β)p(θ|β) being also the marginal p(θ) because of the deterministic relation. However, I am getting slightly lost in the motivation for the derivation of the conditional when the authors pick an arbitrary prior on θ and use it to derive a conditional on β which, along with an arbitrary (“scientific”) prior on β defines a new prior on θ. This works out in the discrete case because β has a finite support. But it is unclear (to me) why it should work in the continuous case [not covered in the paper].

Getting back to the central idea of defining first the distribution on the quantile β, a further motivation is provided in the hierarchical extension of Section 3, where the same quantile distribution is shared by all individuals (e.g., cricket players) in the population, while the underlying distributions for the individuals are otherwise disconnected and unconstrained. (Obviously, a part of the cricket example went far above my head. But one may always idly wonder why all players should share the same distribution. And about what would happen when imposing no quantile constraint but picking instead a direct hierarchical modelling on the θ’s.) This common distribution on β can then be modelled by a Dirichlet hyperprior.

The paper also contains a section on estimating the entire quantile function, which is a wee paradox in that this function is again a deterministic transform of the original parameter θ, but that the authors use instead pointwise estimation, i.e., for each level τ. I find the exercise furthermore paradoxical in that the hierarchical modelling with a common distribution on the quantile β(τ) only is repeated for each τ but separately, while it should be that the entire parameter should share a common distribution. Given the equivalence between the quantile function and the entire parameter θ.

covariant priors, Jeffreys and paradoxes

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on February 9, 2016 by xi'an

“If no information is available, π(α|M) must not deliver information about α.”

In a recent arXival apparently submitted to Bayesian Analysis, Giovanni Mana and Carlo Palmisano discuss of the choice of priors in metrology. Which reminded me of this meeting I attended at the Bureau des Poids et Mesures in Sèvres where similar debates took place, albeit being led by ferocious anti-Bayesians! Their reference prior appears to be the Jeffreys prior, because of its reparameterisation invariance.

“The relevance of the Jeffreys rule in metrology and in expressing uncertainties in measurements resides in the metric invariance.”

This, along with a second order approximation to the Kullback-Leibler divergence, is indeed one reason for advocating the use of a Jeffreys prior. I at first found it surprising that the (usually improper) prior is used in a marginal likelihood, as it cannot be normalised. A source of much debate [and of our alternative proposal].

“To make a meaningful posterior distribution and uncertainty assessment, the prior density must be covariant; that is, the prior distributions of different parameterizations must be obtained by transformations of variables. Furthermore, it is necessary that the prior densities are proper.”

The above quote is quite interesting both in that the notion of covariant is used rather than invariant or equivariant. And in that properness is indicated as a requirement. (Even more surprising is the noun associated with covariant, since it clashes with the usual notion of covariance!) They conclude that the marginal associated with an improper prior is null because the normalising constant of the prior is infinite.

“…the posterior probability of a selected model must not be null; therefore, improper priors are not allowed.”

Maybe not so surprisingly given this stance on improper priors, the authors cover a collection of “paradoxes” in their final and longest section: most of which makes little sense to me. First, they point out that the reference priors of Berger, Bernardo and Sun (2015) are not invariant, but this should not come as a surprise given that they focus on parameters of interest versus nuisance parameters. The second issue pointed out by the authors is that under Jeffreys’ prior, the posterior distribution of a given normal mean for n observations is a t with n degrees of freedom while it is a t with n-1 degrees of freedom from a frequentist perspective. This is not such a paradox since both distributions work in different spaces. Further, unless I am confused, this is one of the marginalisation paradoxes, which more straightforward explanation is that marginalisation is not meaningful for improper priors. A third paradox relates to a contingency table with a large number of cells, in that the posterior mean of a cell probability goes as the number of cells goes to infinity. (In this case, Jeffreys’ prior is proper.) Again not much of a bummer, there is simply not enough information in the data when faced with a infinite number of parameters. Paradox #4 is the Stein paradox, when estimating the squared norm of a normal mean. Jeffreys’ prior then leads to a constant bias that increases with the dimension of the vector. Definitely a bad point for Jeffreys’ prior, except that there is no Bayes estimator in such a case, the Bayes risk being infinite. Using a renormalised loss function solves the issue, rather than introducing as in the paper uniform priors on intervals, which require hyperpriors without being particularly compelling. The fifth paradox is the Neyman-Scott problem, with again the Jeffreys prior the culprit since the estimator of the variance is inconsistent. By a multiplicative factor of 2. Another stone in Jeffreys’ garden [of forking paths!]. The authors consider that the prior gives zero weight to any interval not containing zero, as if it was a proper probability distribution. And “solve” the problem by avoid zero altogether, which requires of course to specify a lower bound on the variance. And then introducing another (improper) Jeffreys prior on that bound… The last and final paradox mentioned in this paper is one of the marginalisation paradoxes, with a bizarre explanation that since the mean and variance μ and σ are not independent a posteriori, “the information delivered by x̄ should not be neglected”.