Archive for improper priors

priors without likelihoods are like sloths without…

Posted in Books, Statistics with tags , , , , , , , , , , , , on September 11, 2017 by xi'an

“The idea of building priors that generate reasonable data may seem like an unusual idea…”

Andrew, Dan, and Michael arXived a opinion piece last week entitled “The prior can generally only be understood in the context of the likelihood”. Which connects to the earlier Read Paper of Gelman and Hennig I discussed last year. I cannot state strong disagreement with the positions taken in this piece, actually, in that I do not think prior distributions ever occur as a given but are rather chosen as a reference measure to probabilise the parameter space and eventually prioritise regions over others. If anything I find myself even further on the prior agnosticism gradation.  (Of course, this lack of disagreement applies to the likelihood understood as a function of both the data and the parameter, rather than of the parameter only, conditional on the data. Priors cannot be depending on the data without incurring disastrous consequences!)

“…it contradicts the conceptual principle that the prior distribution should convey only information that is available before the data have been collected.”

The first example is somewhat disappointing in that it revolves as so many Bayesian textbooks (since Laplace!) around the [sex ratio] Binomial probability parameter and concludes at the strong or long-lasting impact of the Uniform prior. I do not see much of a contradiction between the use of a Uniform prior and the collection of prior information, if only because there is not standardised way to transfer prior information into prior construction. And more fundamentally because a parameter rarely makes sense by itself, alone, without a model that relates it to potential data. As for instance in a regression model. More, following my epiphany of last semester, about the relativity of the prior, I see no damage in the prior being relevant, as I only attach a relative meaning to statements based on the posterior. Rather than trying to limit the impact of a prior, we should rather build assessment tools to measure this impact, for instance by prior predictive simulations. And this is where I come to quite agree with the authors.

“…non-identifiabilities, and near nonidentifiabilites, of complex models can lead to unexpected amounts of weight being given to certain aspects of the prior.”

Another rather straightforward remark is that non-identifiable models see the impact of a prior remain as the sample size grows. And I still see no issue with this fact in a relative approach. When the authors mention (p.7) that purely mathematical priors perform more poorly than weakly informative priors it is hard to see what they mean by this “performance”.

“…judge a prior by examining the data generating processes it favors and disfavors.”

Besides those points, I completely agree with them about the fundamental relevance of the prior as a generative process, only when the likelihood becomes available. And simulatable. (This point is found in many references, including our response to the American Statistician paper Hidden dangers of specifying noninformative priors, with Kaniav Kamary. With the same illustration on a logistic regression.) I also agree to their criticism of the marginal likelihood and Bayes factors as being so strongly impacted by the choice of a prior, if treated as absolute quantities. I also if more reluctantly and somewhat heretically see a point in using the posterior predictive for assessing whether a prior is relevant for the data at hand. At least at a conceptual level. I am however less certain about how to handle improper priors based on their recommendations. In conclusion, it would be great to see one [or more] of the authors at O-Bayes 2017 in Austin as I am sure it would stem nice discussions there! (And by the way I have no prior idea on how to conclude the comparison in the title!)

Greek variations on power-expected-posterior priors

Posted in Books, Statistics, University life with tags , , , , , , on October 5, 2016 by xi'an

Dimitris Fouskakis, Ioannis Ntzoufras and Konstantinos Perrakis, from Athens, have just arXived a paper on power-expected-posterior priors. Just like the power prior and the expected-posterior prior, this approach aims at avoiding improper priors by the use of imaginary data, which distribution is itself the marginal against another prior. (In the papers I wrote on that topic with Juan Antonio Cano and Diego Salmerón, we used MCMC to figure out a fixed point for such priors.)

The current paper (which I only perused) studies properties of two versions of power-expected-posterior priors proposed in an earlier paper by the same authors. For the normal linear model. Using a posterior derived from an unormalised powered likelihood either (DR) integrated in the imaginary data against the prior predictive distribution of the reference model based on the powered likelihood, or (CR) integrated in the imaginary data against the prior predictive distribution of the reference model based on the actual likelihood. The baseline model being the G-prior with g=n². Both versions lead to a marginal likelihood that is similar to BIC and hence consistent. The DR version coincides with the original power-expected-posterior prior in the linear case. The CR version involves a change of covariance matrix. All in all, the CR version tends to favour less complex models, but is less parsimonious as a variable selection tool, which sounds a wee bit contradictory. Overall, I thus feel (possibly incorrectly) that the paper is more an appendix to the earlier paper than a paper in itself as I do not get in the end a clear impression of which method should be preferred.

a mistake in a 1990 paper

Posted in Kids, Statistics, University life with tags , , , , , , , , on August 7, 2016 by xi'an

As we were working on the Handbook of mixture analysis with Sylvia Früwirth-Schnatter and Gilles Celeux today, near Saint-Germain des Près, I realised that there was a mistake in our 1990 mixture paper with Jean Diebolt [published in 1994], in that when we are proposing to use improper “Jeffreys” priors under the restriction that no component of the Gaussian mixture is “empty”, meaning that there are at least two observations generated from each component, the likelihood needs to be renormalised to be a density for the sample. This normalisation constant only depends on the weights of the mixture, which means that, when simulating from the full conditional distribution of the weights, there should be an extra-acceptance step to account for this correction. Of course, the term is essentially equal to one for a large enough sample but this remains a mistake nonetheless! It is funny that it remained undetected for so long in my most cited paper. Checking on Larry’s 1999 paper exploring the idea of excluding terms from the likelihood to allow for improper priors, I did not spot him using a correction either.

Nonparametric applications of Bayesian inference

Posted in Books, Statistics, University life with tags , , , , , , on April 22, 2016 by xi'an

Gary Chamberlain and Guido Imbens published this paper in the Journal of Business & Economic Statistics in 2003. I just came to read it in connection with the paper by Luke Bornn, Niel Shephard and Reza Solgi that I commented a few months ago. The setting is somewhat similar: given a finite support distribution with associated probability parameter θ, a natural prior on θ is a Dirichlet prior. This prior induces a prior on transforms of θ, whether or not they are in close form (for instance as the solution of a moment equation E[F(X,β)]=0. As in Bornn et al. In this paper, Chamberlain and Imbens argue in favour of the limiting Dirichlet with all coefficients equal to zero as a way to avoid prior dominating influence when the number of classes J goes to infinity and the data size remains fixed. But they fail to address the issue that the posterior is no longer defined since some classes get unobserved. They consider instead that the parameters corresponding to those classes are equal to zero with probability one, a convention and not a result. (The computational advantage in using the improper prior sounds at best incremental.) The notion of letting some Dirichlet hyper-parameters going to zero is somewhat foreign to a Bayesian perspective as those quantities should be either fixed or distributed according to an hyper-prior, rather than set to converge according to a certain topology that has nothing to do with prior modelling. (Another reason why setting those quantities to zero does not have the same meaning as picking a Dirac mass at zero.)

“To allow for the possibility of an improper posterior distribution…” (p.4)

This is a weird beginning of a sentence, especially when followed by a concept of expected posterior distribution, which is actually a bootstrap expectation. Not as in Bayesian bootstrap, mind. And thus this feels quite orthogonal to the Bayesian approach. I do however find most interesting this notion of constructing a true expected posterior by imposing samples that ensure properness as it reminds me of our approach to mixtures with Jean Diebolt, where (latent) allocations were prohibited to induce improper priors. The bootstrapped posterior distribution seems to be proposed mostly for assessing the impact of the prior modelling, albeit in an non-quantitative manner. (I fail to understand how the very small bootstrap sample sizes are chosen.)

Obviously, there is a massive difference between this paper and Bornn et al, where the authors use two competing priors in parallel, one on θ and one on β, which induces difficulties in setting priors since the parameter space is concentrated upon a manifold. (In which case I wonder what would happen if one implemented the preposterior idea of Berger and Pérez, 2002, to derive a fixed point solution. That we implemented recently with Diego Salmerón and Juan Antonio Caño in a paper published in Statistica Sinica.. This exhibits a similarity with the above bootstrap proposal in that the posterior gets averaged wrt another posterior.)

at CIRM [#2]

Posted in Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , on March 2, 2016 by xi'an

Sylvia Richardson gave a great talk yesterday on clustering applied to variable selection, which first raised [in me] a usual worry of the lack of background model for clustering. But the way she used this notion meant there was an infinite Dirichlet process mixture model behind. This is quite novel [at least for me!] in that it addresses the covariates and not the observations themselves. I still wonder at the meaning of the cluster as, if I understood properly, the dependent variable is not involved in the clustering. Check her R package PReMiuM for a practical implementation of the approach. Later, Adeline Samson showed us the results of using pMCM versus particle Gibbs for diffusion processes where (a) pMCMC was behaving much worse than particle Gibbs and (b) EM required very few particles and Metropolis-Hastings steps to achieve convergence, when compared with posterior approximations.

Today Pierre Druilhet explained to the audience of the summer school his measure theoretic approach [I discussed a while ago] to the limit of proper priors via q-vague convergence, with the paradoxical phenomenon that a Be(n⁻¹,n⁻¹) converges to a sum of two Dirac masses when the parameter space is [0,1] but to Haldane’s prior when the space is (0,1)! He also explained why the Jeffreys-Lindley paradox vanishes when considering different measures [with an illustration that came from my Statistica Sinica 1993 paper]. Pierre concluded with the above opposition between two Bayesian paradigms, a [sort of] tale of two sigma [fields]! Not that I necessarily agree with the first paradigm that priors are supposed to have generated the actual parameter. If only because it mechanistically excludes all improper priors…

Darren Wilkinson talked about yeast, which is orders of magnitude more exciting than it sounds, because this is Bayesian big data analysis in action! With significant (and hence impressive) results based on stochastic dynamic models. And massive variable selection techniques. Scala, Haskell, Frege, OCaml were [functional] languages he mentioned that I had never heard of before! And Daniel Rudolf concluded the [intense] second day of this Bayesian week at CIRM with a description of his convergence results for (rather controlled) noisy MCMC algorithms.

covariant priors, Jeffreys and paradoxes

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on February 9, 2016 by xi'an

“If no information is available, π(α|M) must not deliver information about α.”

In a recent arXival apparently submitted to Bayesian Analysis, Giovanni Mana and Carlo Palmisano discuss of the choice of priors in metrology. Which reminded me of this meeting I attended at the Bureau des Poids et Mesures in Sèvres where similar debates took place, albeit being led by ferocious anti-Bayesians! Their reference prior appears to be the Jeffreys prior, because of its reparameterisation invariance.

“The relevance of the Jeffreys rule in metrology and in expressing uncertainties in measurements resides in the metric invariance.”

This, along with a second order approximation to the Kullback-Leibler divergence, is indeed one reason for advocating the use of a Jeffreys prior. I at first found it surprising that the (usually improper) prior is used in a marginal likelihood, as it cannot be normalised. A source of much debate [and of our alternative proposal].

“To make a meaningful posterior distribution and uncertainty assessment, the prior density must be covariant; that is, the prior distributions of different parameterizations must be obtained by transformations of variables. Furthermore, it is necessary that the prior densities are proper.”

The above quote is quite interesting both in that the notion of covariant is used rather than invariant or equivariant. And in that properness is indicated as a requirement. (Even more surprising is the noun associated with covariant, since it clashes with the usual notion of covariance!) They conclude that the marginal associated with an improper prior is null because the normalising constant of the prior is infinite.

“…the posterior probability of a selected model must not be null; therefore, improper priors are not allowed.”

Maybe not so surprisingly given this stance on improper priors, the authors cover a collection of “paradoxes” in their final and longest section: most of which makes little sense to me. First, they point out that the reference priors of Berger, Bernardo and Sun (2015) are not invariant, but this should not come as a surprise given that they focus on parameters of interest versus nuisance parameters. The second issue pointed out by the authors is that under Jeffreys’ prior, the posterior distribution of a given normal mean for n observations is a t with n degrees of freedom while it is a t with n-1 degrees of freedom from a frequentist perspective. This is not such a paradox since both distributions work in different spaces. Further, unless I am confused, this is one of the marginalisation paradoxes, which more straightforward explanation is that marginalisation is not meaningful for improper priors. A third paradox relates to a contingency table with a large number of cells, in that the posterior mean of a cell probability goes as the number of cells goes to infinity. (In this case, Jeffreys’ prior is proper.) Again not much of a bummer, there is simply not enough information in the data when faced with a infinite number of parameters. Paradox #4 is the Stein paradox, when estimating the squared norm of a normal mean. Jeffreys’ prior then leads to a constant bias that increases with the dimension of the vector. Definitely a bad point for Jeffreys’ prior, except that there is no Bayes estimator in such a case, the Bayes risk being infinite. Using a renormalised loss function solves the issue, rather than introducing as in the paper uniform priors on intervals, which require hyperpriors without being particularly compelling. The fifth paradox is the Neyman-Scott problem, with again the Jeffreys prior the culprit since the estimator of the variance is inconsistent. By a multiplicative factor of 2. Another stone in Jeffreys’ garden [of forking paths!]. The authors consider that the prior gives zero weight to any interval not containing zero, as if it was a proper probability distribution. And “solve” the problem by avoid zero altogether, which requires of course to specify a lower bound on the variance. And then introducing another (improper) Jeffreys prior on that bound… The last and final paradox mentioned in this paper is one of the marginalisation paradoxes, with a bizarre explanation that since the mean and variance μ and σ are not independent a posteriori, “the information delivered by x̄ should not be neglected”.

read paper [in Bristol]

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , on January 29, 2016 by xi'an

Clifton & Durdham Downs, Bristol, Sept. 25, 2012I went to give a seminar in Bristol last Friday and I chose to present the testing with mixture paper. As we are busy working on the revision, I was eagerly looking for comments and criticisms that could strengthen this new version. As it happened, the (Bristol) Bayesian Cake (Reading) Club had chosen our paper for discussion, two weeks in a row!, hence the title!, and I got invited to join the group the morning prior to the seminar! This was, of course, most enjoyable and relaxed, including an home-made cake!, but also quite helpful in assessing our arguments in the paper. One point of contention or at least of discussion was the common parametrisation between the components of the mixture. Although all parametrisations are equivalent from a single component point of view, I can [almost] see why using a mixture with the same parameter value on all components may impose some unsuspected constraint on that parameter. Even when the parameter is the same moment for both components. This still sounds like a minor counterpoint in that the weight should converge to either zero or one and hence eventually favour the posterior on the parameter corresponding to the “true” model.

Another point that was raised during the discussion is the behaviour of the method under misspecification or for an M-open framework: when neither model is correct does the weight still converge to the boundary associated with the closest model (as I believe) or does a convexity argument produce a non-zero weight as it limit (as hinted by one example in the paper)? I had thought very little about this and hence had just as little to argue though as this does not sound to me like the primary reason for conducting tests. Especially in a Bayesian framework. If one is uncertain about both models to be compared, one should have an alternative at the ready! Or use a non-parametric version, which is a direction we need to explore deeper before deciding it is coherent and convergent!

A third point of discussion was my argument that mixtures allow us to rely on the same parameter and hence the same prior, whether proper or not, while Bayes factors are less clearly open to this interpretation. This was not uniformly accepted!

Thinking afresh about this approach also led me to broaden my perspective on the use of the posterior distribution of the weight(s) α: while previously I had taken those weights mostly as a proxy to the posterior probabilities, to be calibrated by pseudo-data experiments, as for instance in Figure 9, I now perceive them primarily as the portion of the data in agreement with the corresponding model [or hypothesis] and more importantly as a solution for staying away from a Neyman-Pearson-like decision. Or error evaluation. Usually, when asked about the interpretation of the output, my answer is to compare the behaviour of the posterior on the weight(s) with a posterior associated with a sample from each model. Which does sound somewhat similar to posterior predictives if the samples are simulated from the associated predictives. But the issue was not raised during the visit to Bristol, which possibly reflects on how unfrequentist the audience was [the Statistics group is], as it apparently accepted with no further ado the use of a posterior distribution as a soft assessment of the comparative fits of the different models. If not necessarily agreeing the need of conducting hypothesis testing (especially in the case of the Pima Indian dataset!).