Archive for measure theory

MCMC on zero measure sets

Posted in R, Statistics with tags , , , , , , , on March 24, 2014 by xi'an

zeromesSimulating a bivariate normal under the constraint (or conditional to the fact) that x²-y²=1 (a non-linear zero measure curve in the 2-dimensional Euclidean space) is not that easy: if running a random walk along that curve (by running a random walk on y and deducing x as x²=y²+1 and accepting with a Metropolis-Hastings ratio based on the bivariate normal density), the outcome differs from the target predicted by a change of variable and the proper derivation of the conditional. The above graph resulting from the R code below illustrates the discrepancy!


for (t in 2:T){
  if (ace){

If instead we add the proper Jacobian as in


the fit is there. My open question is how to make this derivation generic, i.e. without requiring the (dreaded) computation of the (dreadful) Jacobian.


testing via credible sets

Posted in Statistics, University life with tags , , , , , , , , , , , on October 8, 2012 by xi'an

Måns Thulin released today an arXiv document on some decision-theoretic justifications for [running] Bayesian hypothesis testing through credible sets. His main point is that using the unnatural prior setting mass on a point-null hypothesis can be avoided by rejecting the null when the point-null value of the parameter does not belong to the credible interval and that this decision procedure can be validated through the use of special loss functions. While I stress to my students that point-null hypotheses are very unnatural and should be avoided at all cost, and also that constructing a confidence interval is not the same as designing a test—the former assess the precision in the estimation, while the later opposes two different and even incompatible models—, let us consider Måns’ arguments for their own sake.

The idea of the paper is that there exist loss functions for testing point-null hypotheses that lead to HPD, symmetric and one-sided intervals as acceptance regions, depending on the loss func. This was already found in Pereira & Stern (1999). The issue with these loss functions is that they involve the corresponding credible sets in their definition, hence are somehow tautological. For instance, when considering the HPD set and T(x) as the largest HPD set not containing the point-null value of the parameter, the corresponding loss function is

L(\theta,\varphi,x) = \begin{cases}a\mathbb{I}_{T(x)^c}(\theta) &\text{when }\varphi=0\\ b+c\mathbb{I}_{T(x)}(\theta) &\text{when }\varphi=1\end{cases}

parameterised by a,b,c. And depending on the HPD region.

Måns then introduces new loss functions that do not depend on x and still lead to either the symmetric or the one-sided credible acceptance regions. However, one test actually has two different alternatives (Theorem 2), which makes it essentially a composition of two one-sided tests, while the other test returns the result to a one-sided test (Theorem 3), so even at this face-value level, I do not find the result that convincing. (For the one-sided test, George Casella and Roger Berger (1986) established links between Bayesian posterior probabilities and frequentist p-values.) Both Theorem 3 and the last result of the paper (Theorem 4) use a generic and set-free observation-free loss function (related to eqn. (5.2.1) in my book!, as quoted by the paper) but (and this is a big but) they only hold for prior distributions setting (prior) mass on both the null and the alternative. Otherwise, the solution is to always reject the hypothesis with the zero probability… This is actually an interesting argument on the why-are-credible-sets-unsuitable-for-testing debate, as it cannot bypass the introduction of a prior mass on Θ0!

Overall, I furthermore consider that a decision-theoretic approach to testing should encompass future steps rather than focussing on the reply to the (admittedly dumb) question is θ zero? Therefore, it must have both plan A and plan B at the ready, which means preparing (and using!) prior distributions under both hypotheses. Even on point-null hypotheses.

Now, after I wrote the above, I came upon a Stack Exchange page initiated by Måns last July. This is presumably not the first time a paper stems from Stack Exchange, but this is a fairly interesting outcome: thanks to the debate on his question, Måns managed to get a coherent manuscript written. Great! (In a sense, this reminded me of the polymath experiments of Terry Tao, Timothy Gower and others. Meaning that maybe most contributors could have become coauthors to the paper!)

optimal direction Gibbs

Posted in Statistics, University life with tags , , , , , , on May 29, 2012 by xi'an

An interesting paper appeared on arXiv today. Entitled On optimal direction gibbs sampling, by Andrés Christen, Colin Fox, Diego Andrés Pérez-Ruiz and Mario Santana-Cibrian, it defines optimality as picking the direction that brings the maximum independence between two successive realisations in the Gibbs sampler. More precisely, it aims at choosing the direction e that minimises the mutual information criterion

\int\int f_{Y,X}(y,x)\log\dfrac{f_{Y,X}(y,x)}{f_Y(y)f_X(x)}\,\text{d}x\,\text{d}y

I have a bit of an issue about this choice because it clashes with measure theory. Indeed, in one Gibbs step associated with e the transition kernel is defined in terms of the Lebesgue measure over the line induced by e. Hence the joint density of the pair of successive realisations is defined in terms of the product of the Lebesgue measure on the overall space and of the Lebesgue measure over the line induced by e… While the product in the denominator is defined against the product of the Lebesgue measure on the overall space and itself. The two densities are therefore not comparable since not defined against equivalent measures… The difference between numerator and denominator is actually clearly expressed in the normal example (page 3) when the chain operates over a n dimensional space, but where the conditional distribution of the next realisation is one-dimensional, thus does not relate with the multivariate normal target on the denominator. I therefore do not agree with the derivation of the mutual information henceforth produced as (3).

The above difficulty is indirectly perceived by the authors, who note “we cannot simply choose the best direction: the resulting Gibbs sampler would not be irreducible” (page 5), an objection I had from an earlier page… They instead pick directions at random over the unit sphere and (for the normal case) suggest using a density over those directions such that

h^*(\mathbf{e})\propto(\mathbf{e}^\prime A\mathbf{e})^{1/2}

which cannot truly be called “optimal”.

More globally, searching for “optimal” directions (or more generally transforms) is quite a worthwhile idea, esp. when linked with adaptive strategies…

Bayesian ideas and data analysis

Posted in Books, R, Statistics, Travel, University life with tags , , , , , , , , , , , , , on October 31, 2011 by xi'an

Here is [yet!] another Bayesian textbook that appeared recently. I read it in the past few days and, despite my obvious biases and prejudices, I liked it very much! It has a lot in common (at least in spirit) with our Bayesian Core, which may explain why I feel so benevolent towards Bayesian ideas and data analysis. Just like ours, the book by Ron Christensen, Wes Johnson, Adam Branscum, and Timothy Hanson is indeed focused on explaining the Bayesian ideas through (real) examples and it covers a lot of regression models, all the way to non-parametrics. It contains a good proportion of WinBugs and R codes. It intermingles methodology and computational chapters in the first part, before moving to the serious business of analysing more and more complex regression models. Exercises appear throughout the text rather than at the end of the chapters. As the volume of their book is more important (over 500 pages), the authors spend more time on analysing various datasets for each chapter and, more importantly, provide a rather unique entry on prior assessment and construction. Especially in the regression chapters. The author index is rather original in that it links the authors with more than one entry to the topics they are connected with (Ron Christensen winning the game with the highest number of entries).  Continue reading

Jaynes’ marginalisation paradox

Posted in Books, Statistics, University life with tags , , on June 13, 2011 by xi'an

After delivering my one-day lecture on Jaynes’ Probability Theory, I gave as assignment to the students that they wrote their own analysis of Chapter 15 (Paradoxes of probability theory), given its extensive and exciting coverage of the marginalisation paradoxes and my omission of it in the lecture notes… Up to now, only Jean-Bernard Salomon has returned a (good albeit short) synthesis of the chapter, seemingly siding with Jaynes’ analysis that a “good” noninformative prior should avoid the paradox. (In short, my own view of the problem is to side with Dawid, Stone, and Zidek, in that the paradox is only a paradox when interpreting marginals of infinite measures as if they were probability marginals…) This made me wonder if there could be a squared marginalisation paradox: find a statistical model parameterised by θ with a nuisance parameter η=η(θ) such that when the parameter of interest is ξ=ξ(θ) the prior on η solving the marginalisation paradox is not the same as when the parameter of interest is ζ=ζ(θ) [I have not given the problem more than a few seconds thought so this may prove a logical impossibility!]

Frequency vs. probability

Posted in Statistics with tags , , , , , , , on May 6, 2011 by xi'an

Probabilities obtained by maximum entropy cannot be relevant to physical predictions because they have nothing to do with frequencies.” E.T. Jaynes, PT, p.366

A frequency is a factual property of the real world that we measure or estimate. The phrase `estimating a probability’ is just as much an incongruity as `assigning a frequency’. The fundamental, inescapable distinction between probability and frequency lies in this relativity principle: probabilities change when we change our state of knowledge, frequencies do not.” E.T. Jaynes, PT, p.292

A few days ago, I got the following email exchange with Jelle Wybe de Jong from The Netherlands:

Q. I have a question regarding your slides of your presentation of Jaynes’ Probability Theory. You used the [above second] quote: Do you agree with this statement? It seems to me that a lot of  ‘Bayesians’ still refer to ‘estimating’ probabilities. Does it make sense for example for a bank to estimate a probability of default for their loan portfolio? Or does it only make sense to estimate a default frequency and summarize the uncertainty (state of knowledge) through the posterior? Continue reading

MAP, MLE and loss

Posted in Statistics with tags , , , , on April 25, 2011 by xi'an

Michael Evans and Gun Ho Jang posted an arXiv paper where they discuss the connection between MAP, least relative surprise (or maximum profile likelihood) estimators, and loss functions. I posted a while ago my perspective on MAP estimators, followed by several comments on the Bayesian nature of those estimators, hence will not reproduce them here, but the core of the matter is that neither MAP estimators, nor MLEs are really justified by a decision-theoretic approach, at least in a continuous parameter space. And that the dominating measure [arbitrarily] chosen on the parameter space impacts the value of the MAP, as demonstrated by Druihlet and Marin in 2007.

Continue reading


Get every new post delivered to your Inbox.

Join 634 other followers