## Jeffreys prior with improper posterior

**I**n a complete coincidence with my visit to Warwick this week, I became aware of the paper “Inference in two-piece location-scale models with Jeffreys priors” recently published in Bayesian Analysis by Francisco Rubio and Mark Steel, both from Warwick. Paper where they exhibit a closed-form Jeffreys prior for the skewed distribution

where f is a symmetric density, namely

where

only to show immediately after that this prior does not allow for a proper posterior, no matter what the sample size is. While the above skewed distribution can always be interpreted as a mixture, being a weighted sum of two terms, it is not strictly speaking a mixture, if only because the “component” can be identified from the observation (depending on which side of μ is stands). The likelihood is therefore a product of simple terms rather than a product of a sum of two terms.

**A**s a solution to this conundrum, the authors consider the alternative of the “independent Jeffreys priors”, which are made of a product of conditional Jeffreys priors, i.e., by computing the Jeffreys prior one parameter at a time with all other parameters considered to be fixed. Which differs from the reference prior, of course, but would have been my second choice as well. Despite criticisms expressed by José Bernardo in the discussion of the paper… The difficulty (in my opinion) resides in the choice (and difficulty) of the parameterisation of the model, since those priors are not parameterisation-invariant. (Xinyi Xu makes the important comment that even those priors incorporate strong if hidden information. Which relates to our earlier discussion with Kaniav Kamari on the “dangers” of prior modelling.)

**A**lthough the outcome is puzzling, I remain just slightly sceptical of the income, namely Jeffreys prior and the corresponding Fisher information: the fact that the density involves an indicator function and is thus discontinuous in the location μ at the observation x makes the likelihood function not differentiable and hence the derivation of the Fisher information not strictly valid. Since the indicator part cannot be differentiated. Not that I am seeing the Jeffreys prior as the ultimate grail for non-informative priors, far from it, but there is definitely something specific in the discontinuity in the density. (In connection with the later point, Weiss and Suchard deliver a highly critical commentary on the non-need for reference priors and the preference given to a non-parametric Bayes primary analysis. Maybe making the point towards a greater convergence of the two perspectives, objective Bayes and non-parametric Bayes.)

**T**his paper and the ensuing discussion about the properness of the Jeffreys posterior reminded me of our earliest paper on the topic with Jean Diebolt. Where we used improper priors on location and scale parameters but prohibited allocations (in the Gibbs sampler) that would lead to less than two observations per components, thereby ensuring that the (truncated) posterior was well-defined. (This feature also remained in the Series B paper, submitted at the same time, namely mid-1990, but only published in 1994!) Larry Wasserman proved ten years later that this truncation led to consistent estimators, but I had not thought about it in very long while. I still like this notion of forcing some (enough) datapoints into each component for an allocation (of the latent indicator variables) to be an acceptable Gibbs move. This is obviously not compatible with the iid representation of a mixture model, but it expresses the requirement that components all have a meaning *in terms of the data*, namely that all components contributed to generating a part of the data. This translates as a form of weak prior information on how much we trust the model and how meaningful each component is (in opposition to adding meaningless extra-components with almost zero weights or almost identical parameters).

**A**s a marginalia, the insistence in Rubio and Steel’s paper that all observations in the sample be different also reminded me of a discussion I wrote for one of the Valencia proceedings (Valencia 6 in 1998) where Mark presented a paper with Carmen Fernández on this issue of handling duplicated observations modelled by absolutely continuous distributions. (I am afraid my discussion is not worth the $250 price tag given by amazon!)

May 12, 2014 at 1:41 pm

Many thanks, Prof. Robert, for the attention to our paper and your comments.

You expressed a concern about the lack of differentiability of the density at the mode. In this context, all we need to verify is that the first derivative of the density, for a regular underlying symmetric f, does exist. It is the second derivative that does not exist at the mode (see “On parameter orthogonality in symmetric and skew models”, Jones and Anaya-Izquierdo, 2011). For this reason, we have used the basic definition of the Fisher information matrix (FIM), which only involves first derivatives. Moreover, the existence of the FIM usually requires differentiability almost everywhere.

The presence of repeated observations is in fact something that has to be taken into consideration when using improper priors, given that this may destroy the existence of the posterior under some sampling models. This is an issue of practical, not just theoretical, importance.

The discussion of the paper in Bayesian Analysis is very interesting, indeed, covering prior elicitation for this sort of models (and in general), pros and cons of different “objective” priors, and different sorts of flexible models.

May 12, 2014 at 5:30 pm

Thanks! I was not particularly worried about the differentiability, to be completely fair!!! Non-everywhere-differentiable likelihoods however contain a sort of information that may fail to be reflected by Fisher’s information: the clearcut gap in the density at

However, I more seriously object to the point about repeated observations as having identical observations does not agree with a (Lebesgue) absolutely continuous model. This also was the core of my comments on the Valencia 6 paper. If repeated values are a possible occurence, the model should reflect this possibility.

May 12, 2014 at 12:40 am

The objective Bayes vs non-Parametric Bayes argument/merging puzzles me slightly. Flexibility does not imply objectiveness. For awkward parameters (where the geometry of the path traced by the parameter(s) through the model space is highly nonlinear), I don’t immediately see how NP Bayes makes the problem easier. Doesn’t it just shift the difficulty? (Or am I missing the point, as usual?)

May 12, 2014 at 10:26 am

what I mean by this suggestion is that NP Bayes is rarely endowed with subjective priors that one can support. The choice of reference priors in NP settings is therefore even more relevant than in parametric models.

May 12, 2014 at 10:40 am

Ah! yes! That’s really interesting!!! And should it be difficult?

For a subjective prior, Watson and Holmes gave that nifty description of the Dirichlet process as sampling in KL balls of fixed radius around a (discrete) base distribution, so you could imagine putting a prior on the parameter of the DP that controls how far you’re going [say, an exponential prior or exp(-lambda *sqrt(.))].

I wonder if there’s a way to shift this over to the reference setting…