**L**ast December, Gunnar Taraldsen, Jarle Tufto, and Bo H. Lindqvist arXived a paper on using priors that lead to improper posteriors and [trying to] getting away with it! The central concept in their approach is Rényi’s generalisation of Kolmogorov’s version to define conditional probability distributions from infinite mass measures by conditioning on finite mass measurable sets. A position adopted by Dennis Lindley in his 1964 book .And already discussed in a few ‘Og’s posts. While the theory thus developed indeed allows for the manipulation of improper posteriors, I have difficulties with the inferential aspects of the construct, since one cannot condition on an arbitrary finite measurable set without prior information. Things get a wee bit more outwardly when considering “data” with infinite mass, in Section 4.2, since they cannot be properly normalised (although I find the example of the degenerate multivariate Gaussian distribution puzzling as it is not a matter of improperness, since the degenerate Gaussian has a well-defined density against the right dominating measure). The paper also discusses marginalisation paradoxes, by acknowledging that marginalisation is no longer feasible with improper quantities. And the Jeffreys-Lindley paradox, with a resolution that uses the sum of the Dirac mass at the null, δ⁰, and of the Lebesgue measure on the real line, λ, as the dominating measure. This indeed solves the issue of the arbitrary constant in the Bayes factor, since it is “the same” on the null hypothesis and elsewhere, but I do not buy the argument, as I see no reason to favour δ⁰+λ over 3.141516 δ⁰+λ or δ⁰+1.61718 λ… (This section 4.5 also illustrates that the choice of the sequence of conditioning sets has an impact on the limiting measure, in the Rényi sense.) In conclusion, after reading the paper, I remain uncertain as to how to exploit this generalisation from an inferential (Bayesian?) viewpoint, since improper posteriors do not clearly lead to well-defined inferential procedures…

## Archive for improper priors

## statistics with improper posteriors [or not]

Posted in Statistics with tags Alfréd Rényi, Andrei Kolmogorov, Dennis Lindley, improper posteriors, improper priors, Jeffreys-Lindley paradox, marginalisation paradoxes on March 6, 2019 by xi'an## Jeffreys priors for hypothesis testing [Bayesian reads #2]

Posted in Books, Statistics, University life with tags Arnold Zellner, Bayes factor, Bayesian tests of hypotheses, CDT, class, classics, Gaussian mixture, improper priors, Jeffreys prior, JRSSB, Kullback-Leibler divergence, Oxford, PhD course, Saint Giles cemetery, Susie Bayarri, Theory of Probability, University of Oxford on February 9, 2019 by xi'anA second (re)visit to a reference paper I gave to my OxWaSP students for the last round of this CDT joint program. Indeed, this may be my first complete read of Susie Bayarri and Gonzalo Garcia-Donato 2008 Series B paper, inspired by Jeffreys’, Zellner’s and Siow’s proposals in the Normal case. *(Disclaimer: I was not the JRSS B editor for this paper.) *Which I saw as a talk at the O’Bayes 2009 meeting in Phillie.

The paper aims at constructing formal rules for objective proper priors in testing embedded hypotheses, in the spirit of Jeffreys’ Theory of Probability “hidden gem” (Chapter 3). The proposal is based on symmetrised versions of the Kullback-Leibler divergence κ between null and alternative used in a transform like an inverse power of 1+κ. With a power large enough to make the prior proper. Eventually multiplied by a reference measure (i.e., the arbitrary choice of a dominating measure.) Can be generalised to any intrinsic loss (not to be confused with an intrinsic prior à la Berger and Pericchi!). Approximately Cauchy or Student’s t by a Taylor expansion. To be compared with Jeffreys’ original prior equal to the derivative of the atan transform of the root divergence (!). A delicate calibration by an effective sample size, lacking a general definition.

At the start the authors rightly insist on having the nuisance parameter v to differ for each model but… as we all often do they relapse back to having the “same ν” in both models for integrability reasons. Nuisance parameters make the definition of the divergence prior somewhat harder. Or somewhat arbitrary. Indeed, as in reference prior settings, the authors work first conditional on the nuisance then use a prior on ν that may be improper by the “same” argument. (Although *conditioning* is not the proper term if the marginal prior on ν is improper.)

The paper also contains an interesting case of the translated Exponential, where the prior is L¹ Student’s t with 2 degrees of freedom. And another one of mixture models albeit in the simple case of a location parameter on one component only.

## foundations of probability

Posted in Books, Statistics with tags Alfréd Rényi, Foundations of Probability, improper priors, The American Statistician, University of Warwick on December 1, 2017 by xi'an**F**ollowing my reading of a note by Gunnar Taraldsen and co-authors on improper priors, I checked the 1970 book of Rényi from the Library at Warwick. (First time I visited this library, where I got very efficient help in finding and borrowing this book!)

“…estimates of probability of an event made by different persons may be different and each such estimate is to a certain extent subjective.” (p.33)

The main argument from Rényi used by the above mentioned note (and an earlier paper in The American Statistician) is that “*every probability is in reality a conditional probability*” (p.34). Which may be a pleonasm as everything depends on the settings in which it is applied. And as such not particularly new since conditioning is also present in e.g. Jeffreys’ book. In this approach, the definition of the conditional probability is traditional, if restricted to condition on a subset of elements from the σ algebra. The interesting part in the book is rather that a measure on this subset can be derived from the conditionals. And extended to the whole σ algebra. And is unique up to a multiplicative constant. Interesting because this indeed produces a rigorous way of handling improper priors.

“Let the random point (ξ,η) be uniformly distributed over the whole (x,y) plane.” (p.83)

Rényi also defines *random variables* ξ on conditional probability spaces, with conditional densities. With constraints on ξ for those to exist. I have more difficulties to ingest this notion as I do not see the meaning of the above quote or of the quantity

**P(**a<ξ<b**|**c<ξ<d**)**

when **P(**a<ξ<b**)** is not defined. As for instance I see no way of generating such a ξ in this case. (Of course, it is always possible to bring in a new definition of random variables that only agrees with regular ones for finite measure.)

## a new paradigm for improper priors

Posted in Books, pictures, Statistics, Travel with tags Alfréd Rényi, Andrei Kolmogorov, axioms of probability, convergence of Gibbs samplers, improper priors, σ-algebra, marginalisation paradoxes, Norway, Trondheim on November 6, 2017 by xi'an**G**unnar Taraldsen and co-authors have arXived a short note on using improper priors from a new perspective. Generalising an earlier 2016 paper in JSPI on the same topic. Which both relate to a concept introduced by Rényi (who himself attributes the idea to Kolmogorov). Namely that random variables measures are to be associated with arbitrary measures [not necessarily σ-finite measures, the later defining σ-finite random variables], rather than those with total mass one. Which allows for an alternate notion of conditional probability in the case of σ-finite random variables, with the perk that this conditional probability distribution is itself of mass 1 (a.e.). Which we know happens when moving from prior to proper posterior.

I remain puzzled by the 2016 paper though as I do not follow the meaning of a *random variable* associated with an *infinite mass probability measure*. If the point is limited to construct posterior probability distributions associated with improper priors, there is little value in doing so. The argument in the 2016 paper is however that one can then define a conditional distribution in marginalisation paradoxes à la Stone, Dawid and Zidek (1973) where the marginal does not exist. Solving with this formalism the said marginalisation paradoxes as conditional distributions are only defined for σ-finite random variables. Which gives a fairly different conclusion from either Stone, Dawid and Zidek (1973) [with whom I agree, namely that there is no paradox because there is no “joint” distribution] or Jaynes (1973) [with whom I less agree!, in that the use of an invariant measure to make the discrepancy go away is not a particularly strong argument in favour of this measure]. The 2016 paper also draws an interesting connection with the study by Jim Hobert and George Casella (in Jim’s thesis) of [null recurrent or transient] Gibbs samplers with no joint [proper] distribution. Which in some situations can produce proper subchains, a phenomenon later exhibited by Alan Gelfand and Sujit Sahu (and Xiao-Li Meng as well if I correctly remember!). But I see no advantage in following this formalism, as it does not impact whether the chain is transient or null recurrent, or anything connected with its implementation. Plus a link to the approximation of improper priors by sequences of proper ones by Bioche and Druihlet I discussed a while ago.

## priors without likelihoods are like sloths without…

Posted in Books, Statistics with tags Austin, Bayes factors, Bayesian Analysis, identifiability, improper priors, noninformative priors, O'Bayes17, Pierre Simon Laplace, posterior predictive, reference priors, sloth, The American Statistician, The University of Texas at Austin on September 11, 2017 by xi'an

“The idea of building priors that generate reasonable data may seem like an unusual idea…”

**A**ndrew, Dan, and Michael arXived a opinion piece last week entitled “The prior can generally only be understood in the context of the likelihood”. Which connects to the earlier Read Paper of Gelman and Hennig I discussed last year. I cannot state strong disagreement with the positions taken in this piece, actually, in that I do not think prior distributions ever occur as *a given* but are rather chosen as a reference measure to probabilise the parameter space and eventually prioritise regions over others. If anything I find myself even further on the prior agnosticism gradation. (Of course, this lack of disagreement applies to the likelihood understood as a function of both the data and the parameter, rather than of the parameter only, conditional on the data. Priors cannot be depending on the data without incurring disastrous consequences!)

“…it contradicts the conceptual principle that the prior distribution should convey only information that is available before the data have been collected.”

The first example is somewhat disappointing in that it revolves as so many Bayesian textbooks (since Laplace!) around the [sex ratio] Binomial probability parameter and concludes at the strong or long-lasting impact of the Uniform prior. I do not see much of a contradiction between the use of a Uniform prior and the collection of prior information, if only because there is not standardised way to transfer prior information into prior construction. And more fundamentally because a parameter rarely makes sense by itself, alone, without a model that relates it to potential data. As for instance in a regression model. More, following my epiphany of last semester, about the relativity of the prior, I see no damage in the prior being relevant, as I only attach a *relative* meaning to statements based on the posterior. Rather than trying to limit the impact of a prior, we should rather build assessment tools to measure this impact, for instance by prior predictive simulations. And this is where I come to quite agree with the authors.

“…non-identifiabilities, and near nonidentifiabilites, of complex models can lead to unexpected amounts of weight being given to certain aspects of the prior.”

Another rather straightforward remark is that non-identifiable models see the impact of a prior remain as the sample size grows. And I still see no issue with this fact in a relative approach. When the authors mention (p.7) that purely mathematical priors perform more poorly than weakly informative priors it is hard to see what they mean by this “performance”.

“…judge a prior by examining the data generating processes it favors and disfavors.”

Besides those points, I completely agree with them about the fundamental relevance of the prior as a generative process, only when the likelihood becomes available. And simulatable. (This point is found in many references, including our response to the American Statistician paper *Hidden dangers of specifying noninformative priors*, with Kaniav Kamary. With the same illustration on a logistic regression.) I also agree to their criticism of the marginal likelihood and Bayes factors as being so strongly impacted by the choice of a prior, if treated as absolute quantities. I also if more reluctantly and somewhat heretically see a point in using the posterior predictive for assessing whether a prior is relevant for the data at hand. At least at a conceptual level. I am however less certain about how to handle improper priors based on their recommendations. In conclusion, it would be great to see one [or more] of the authors at O-Bayes 2017 in Austin as I am sure it would stem nice discussions there! (And by the way I have no prior idea on how to conclude the comparison in the title!)

## Greek variations on power-expected-posterior priors

Posted in Books, Statistics, University life with tags Athens University of Economics and Business, g-prior, Greece, improper priors, objective Bayes, power posterior, Power-Expected-Posterior Priors on October 5, 2016 by xi'an**D**imitris Fouskakis, Ioannis Ntzoufras and Konstantinos Perrakis, from Athens, have just arXived a paper on power-expected-posterior priors. Just like the power prior and the expected-posterior prior, this approach aims at avoiding improper priors by the use of imaginary data, which distribution is itself the marginal against another prior. (In the papers I wrote on that topic with Juan Antonio Cano and Diego Salmerón, we used MCMC to figure out a fixed point for such priors.)

The current paper (which I only perused) studies properties of two versions of power-expected-posterior priors proposed in an earlier paper by the same authors. For the normal linear model. Using a posterior derived from an unormalised powered likelihood either (DR) integrated in the imaginary data against the prior predictive distribution of the reference model based on the powered likelihood, or (CR) integrated in the imaginary data against the prior predictive distribution of the reference model based on the actual likelihood. The baseline model being the G-prior with g=n². Both versions lead to a marginal likelihood that is similar to BIC and hence consistent. The DR version coincides with the original power-expected-posterior prior in the linear case. The CR version involves a change of covariance matrix. All in all, the CR version tends to favour less complex models, but is less parsimonious as a variable selection tool, which sounds a wee bit contradictory. Overall, I thus feel (possibly incorrectly) that the paper is more an appendix to the earlier paper than a paper in itself as I do not get in the end a clear impression of which method should be preferred.