Archive for Dickey-Savage ratio

Astrostatistics school

Posted in Mountains, pictures, R, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , , , on October 17, 2017 by xi'an

What a wonderful week at the Astrostat [Indian] summer school in Autrans! The setting was superb, on the high Vercors plateau overlooking both Grenoble [north] and Valence [west], with the colours of the Fall at their brightest on the foliage of the forests rising on both sides of the valley and a perfect green on the fields at the centre, with sun all along, sharp mornings and warm afternoons worthy of a late Indian summer, too many running trails [turning into X country ski trails in the Winter] to contemplate for a single week [even with three hours of running over two days], many climbing sites on the numerous chalk cliffs all around [but a single afternoon for that, more later in another post!]. And of course a group of participants eager to learn about Bayesian methodology and computational algorithms, from diverse [astronomy, cosmology and more] backgrounds, trainings and countries. I was surprised at the dedication of the participants travelling all the way from Chile, Péru, and Hong Kong for the sole purpose of attending the school. David van Dyk gave the first part of the school on Bayesian concepts and MCMC methods, Roberto Trotta the second part on Bayesian model choice and hierarchical models, and myself a third part on, surprise, surprise!, approximate Bayesian computation. Plus practicals on R.

As it happens Roberto had to cancel his participation and I turned for a session into Christian Roberto, presenting his slides in the most objective possible fashion!, as a significant part covered nested sampling and Savage-Dickey ratios, not exactly my favourites for estimating constants. David joked that he was considering postponing his flight to see me talk about these, but I hope I refrained from engaging into controversy and criticisms… If anything because this was not of interest for the participants. Indeed when I started presenting ABC through what I thought was a pedestrian example, namely Rasmus Baath’s socks, I found that the main concern was not running an MCMC sampler or a substitute ABC algorithm but rather an healthy questioning of the construction of the informative prior in that artificial setting, which made me quite glad I had planned to cover this example rather than an advanced model [as, e.g., one of those covered in the packages abc, abctools, or abcrf]. Because it generated those questions about the prior [why a Negative Binomial? why these hyperparameters? &tc.] and showed how programming ABC turned into a difficult exercise even in this toy setting. And while I wanted to give my usual warning about ABC model choice and argue for random forests as a summary selection tool, I feel I should have focussed instead on another example, as this exercise brings out so clearly the conceptual difficulties with what is taught. Making me quite sorry I had to leave one day earlier. [As did missing an extra run!] Coming back by train through the sunny and grape-covered slopes of Burgundy hills was an extra reward [and no one in the train commented about the local cheese travelling in my bag!]


Using MCMC output to efficiently estimate Bayes factors

Posted in Books, R, Statistics, University life with tags , , , , on May 19, 2016 by xi'an

As I was checking for software to answer a query on X validated about generic Bayes factor derivation, I came across an R software called BayesFactor, which only applies in regression settings and relies on the Savage-Dickey representation of the Bayes factor


when the null hypothesis writes as θ=θ⁰ (and possibly additional nuisance parameters with [roughly speaking] an independent prior). As we discussed in our paper with Jean-Michel Marin [which got ignored by large!], this representation of the Bayes factor is based on picking a very specific version of the prior, or more exactly of three prior densities. Assuming such versions are selected, I wonder at the performances of this approximation, given that it involves approximating the marginal posterior at θ⁰….

“To ensure that the Bayes factor we compute using the Savage–Dickey ratio is the the ratio of marginal densities that we intend, the condition (…) is easily met by models which specify priors in which the nuisance parameters are independent of the parameters of interest.” Morey et al. (2011)

First, when reading Morey at al. (2011), I realised (a wee bit late!) that Chib’s method is nothing but a version of the Savage-Dickey representation when the marginal posterior can be estimated in a parametric (Rao-Blackwellised) way. However, outside hierarchical models based on conjugate priors such parametric approximations are intractable and non-parametric versions must be invoked instead, which necessarily degrades the quality of the method. A degradation that escalates with the dimension of the parameter θ. In addition, I am somewhat perplexed by the use of a Rao-Blackwell argument in the setting of the Dickey-Savage representation. Indeed this representation assumes that

\pi_1(\psi|\theta_0)=\pi_0(\psi) \  \ \text{or}\quad \pi_1(\theta_0,\psi)=\pi_1(\theta_0)\pi_0(\psi)

which means that [the specific version of] the conditional density of θ⁰ given ψ should not depend on the nuisance parameter. But relying on a Rao-Blackwellisation leads to estimate the marginal posterior via full conditionals. Of course, θ given ψ and y may depend on ψ, but still… Morey at al. (2011) advocate the recourse to Chib’s formula as optimal but this obviously requires the full conditional to be available. They acknowledge this point as moot, since it is sufficient from their perspective to specify a conjugate prior. They consider this to be a slight modification of the model (p.377). However, I see the evaluation of an estimated density at a single (I repeat, single!) point as being the direst part of the method as it is clearly more sensitive to approximations that the evaluation of a whole integral, since the later incorporates an averaging effect by definition. Hence, even if this method was truly available for all models, I would be uncertain of its worth when compared with other methods, except the harmonic mean estimator of course!

On the side, Morey at al. (2011) study a simple one-sample t test where they use an improper prior on the nuisance parameter σ, under both models. While the Savage-Dickey representation is correct in this special case, I fail to see why the identity would apply in every case under an improper prior. In particular, independence does not make sense with improper priors. The authors also indicate the possible use of this Bayes factor approximation for encompassing models. At first, I thought this could be most useful in our testing by mixture framework where we define an encompassing model as a mixture. However, I quickly realised that using a Beta Be(a,a) prior on the weight α with a<1 leads to an infinite density value at both zero and one, hence cannot be compatible with a Savage-Dickey representation of the Bayes factor.

Bayesian model averaging in astrophysics [guest post]

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , on August 12, 2015 by xi'an

.[Following my posting of a misfiled 2013 blog, Ewan Cameron told me of the impact of this paper in starting his own blog and I asked him for a guest post, resulting in this analysis, much deeper than mine. No warning necessary this time!]

Back in February 2013 when “Bayesian Model Averaging in Astrophysics: A Review” by Parkinson & Liddle (hereafter PL13) first appeared on the arXiv I was a keen, young(ish) postdoc eager to get stuck into debates about anything and everything ‘astro-statistical’. And with its seemingly glaring flaws, PL13 was more grist to the mill. However, despite my best efforts on various forums I couldn’t get a decent fight started over the right way to do model averaging (BMA) in astronomy, so out of sheer frustration two months later I made my own soapbox to shout from at Another Astrostatistics Blog. Having seen PL13 reviewed recently here on Xian’s Og it feels like the right time to revisit the subject and reflect on where BMA in astronomy is today.

As pointed out to me back in 2013 by Tom Loredo, the act of Bayesian model averaging has been around much longer than its name; indeed an early astronomical example appears in Gregory & Loredo (1992) in which the posterior mean representation of an unknown signal is constructed for an astronomical “light-curve”, averaging over a set of constant and periodic candidate models. Nevertheless the wider popularisation of model averaging in astronomy has only recently taken place through a variety of applications in cosmology: e.g. Liddle, Mukherjee, Parkinson & Wang (2006) and Vardanyan, Trotta & Silk (2011).

In contrast to earlier studies like Gregory & Loredo (1992)—or the classic review on BMA by Hoeting et al. (1999)—in which the target of model averaging is typically either a utility function, a set of future observations, or a latent parameter of the observational process (e.g. the unknown “light-curve” shape) shared naturally by all competing models, the proposal of cosmological BMA studies is to produce a model-averaged version of the posterior for a given ‘shared’ parameter: a so-called “model-averaged PDF”. This proposal didn’t sit well with me back in 2013, and it still doesn’t sit well with me today. Philosophically: without a model a parameter has no meaning, so why should we want to seek meaning in the marginalised distribution of a parameter over an entire set of models? And, practically: to put it another way, without knowing the model ‘label’ to which a given mass of model-averaged parameter probability belongs there’s nothing much useful we can do with this ‘PDF’: nothing much we can say about the data we’ve just analysed and nothing much we can say about future experiments. Whereas the space of the observed data is shared automatically by all competing models, it seems to me to be somehow “un-Bayesian” to place the further restriction that the parameters of separate models share the same scale and topology. I say “un-Bayesian” since this mode of model averaging suggests a formulation of the parameter space + prior pairing stronger than the statement of one’s prior beliefs for the distribution of observable data given the model. But I would be happy to hear arguments from the other side in the comments box below … ! Continue reading

Harold Jeffreys’ default Bayes factor [for psychologists]

Posted in Books, Statistics, University life with tags , , , , , , on January 16, 2015 by xi'an

“One of Jeffreys’ goals was to create default Bayes factors by using prior distributions that obeyed a series of general desiderata.”

The paper Harold Jeffreys’s default Bayes factor hypothesis tests: explanation, extension, and application in Psychology by Alexander Ly, Josine Verhagen, and Eric-Jan Wagenmakers is both a survey and a reinterpretation cum explanation of Harold Jeffreys‘ views on testing. At about the same time, I received a copy from Alexander and a copy from the journal it had been submitted to! This work starts with a short historical entry on Jeffreys’ work and career, which includes four of his principles, quoted verbatim from the paper:

  1. “scientific progress depends primarily on induction”;
  2. “in order to formalize induction one requires a logic of partial belief” [enters the Bayesian paradigm];
  3. “scientific hypotheses can be assigned prior plausibility in accordance with their complexity” [a.k.a., Occam’s razor];
  4. “classical “Fisherian” p-values are inadequate for the purpose of hypothesis testing”.

“The choice of π(σ)  therefore irrelevant for the Bayes factor as long as we use the same weighting function in both models”

A very relevant point made by the authors is that Jeffreys only considered embedded or nested hypotheses, a fact that allows for having common parameters between models and hence some form of reference prior. Even though (a) I dislike the notion of “common” parameters and (b) I do not think it is entirely legit (I was going to write proper!) from a mathematical viewpoint to use the same (improper) prior on both sides, as discussed in our Statistical Science paper. And in our most recent alternative proposal. The most delicate issue however is to derive a reference prior on the parameter of interest, which is fixed under the null and unknown under the alternative. Hence preventing the use of improper priors. Jeffreys tried to calibrate the corresponding prior by imposing asymptotic consistency under the alternative. And exact indeterminacy under “completely uninformative” data. Unfortunately, this is not a well-defined notion. In the normal example, the authors recall and follow the proposal of Jeffreys to use an improper prior π(σ)∝1/σ on the nuisance parameter and argue in his defence the quote above. I find this argument quite weak because suddenly the prior on σ becomes a weighting function... A notion foreign to the Bayesian cosmology. If we use an improper prior for π(σ), the marginal likelihood on the data is no longer a probability density and I do not buy the argument that one should use the same measure with the same constant both on σ alone [for the nested hypothesis] and on the σ part of (μ,σ) [for the nesting hypothesis]. We are considering two spaces with different dimensions and hence orthogonal measures. This quote thus sounds more like wishful thinking than like a justification. Similarly, the assumption of independence between δ=μ/σ and σ does not make sense for σ-finite measures. Note that the authors later point out that (a) the posterior on σ varies between models despite using the same data [which shows that the parameter σ is far from common to both models!] and (b) the [testing] Cauchy prior on δ is only useful for the testing part and should be replaced with another [estimation] prior when the model has been selected. Which may end up as a backfiring argument about this default choice.

“Each updated weighting function should be interpreted as a posterior in estimating σ within their own context, the model.”

The re-derivation of Jeffreys’ conclusion that a Cauchy prior should be used on δ=μ/σ makes it clear that this choice only proceeds from an imperative of fat tails in the prior, without solving the calibration of the Cauchy scale. (Given the now-available modern computing tools, it would be nice to see the impact of this scale γ on the numerical value of the Bayes factor.) And maybe it also proceeds from a “hidden agenda” to achieve a Bayes factor that solely depends on the t statistic. Although this does not sound like a compelling reason to me, since the t statistic is not sufficient in this setting.

In a differently interesting way, the authors mention the Savage-Dickey ratio (p.16) as a way to represent the Bayes factor for nested models, without necessarily perceiving the mathematical difficulty with this ratio that we pointed out a few years ago. For instance, in the psychology example processed in the paper, the test is between δ=0 and δ≥0; however, if I set π(δ=0)=0 under the alternative prior, which should not matter [from a measure-theoretic perspective where the density is uniquely defined almost everywhere], the Savage-Dickey representation of the Bayes factor returns zero, instead of 9.18!

“In general, the fact that different priors result in different Bayes factors should not come as a surprise.”

The second example detailed in the paper is the test for a zero Gaussian correlation. This is a sort of “ideal case” in that the parameter of interest is between -1 and 1, hence makes the choice of a uniform U(-1,1) easy or easier to argue. Furthermore, the setting is also “ideal” in that the Bayes factor simplifies down into a marginal over the sample correlation only, under the usual Jeffreys priors on means and variances. So we have a second case where the frequentist statistic behind the frequentist test[ing procedure] is also the single (and insufficient) part of the data used in the Bayesian test[ing procedure]. Once again, we are in a setting where Bayesian and frequentist answers are in one-to-one correspondence (at least for a fixed sample size). And where the Bayes factor allows for a closed form through hypergeometric functions. Even in the one-sided case. (This is a result obtained by the authors, not by Jeffreys who, as the proper physicist he was, obtained approximations that are remarkably accurate!)

“The fact that the Bayes factor is independent of the intention with which the data have been collected is of considerable practical importance.”

The authors have a side argument in this section in favour of the Bayes factor against the p-value, namely that the “Bayes factor does not depend on the sampling plan” (p.29), but I find this fairly weak (or tongue in cheek) as the Bayes factor does depend on the sampling distribution imposed on top of the data. It appears that the argument is mostly used to defend sequential testing.

“The Bayes factor (…) balances the tension between parsimony and goodness of fit, (…) against overfitting the data.”

In fine, I liked very much this re-reading of Jeffreys’ approach to testing, maybe the more because I now think we should get away from it! I am not certain it will help in convincing psychologists to adopt Bayes factors for assessing their experiments as it may instead frighten them away. And it does not bring an answer to the vexing issue of the relevance of point null hypotheses. But it constitutes a lucid and innovative of the major advance represented by Jeffreys’ formalisation of Bayesian testing.

generalised Savage-Dickey ratio

Posted in Statistics, University life with tags , , , , , , , , on November 11, 2013 by xi'an

KONICA MINOLTA DIGITAL CAMERAToday, Ewan Cameron arXived a paper that generalises our Robert and Marin (2010) paper on the measure theoretic difficulties (or impossibilities) of the Savage-Dickey ratio and on the possible resolutions. (A paper of mine’s I like very much despite it having neither impact nor quotes, whatsoever! Until this paper.) I met Ewan last year when he was completing a PhD with Tony Pettitt at QUT in astrostatistics, but he also worked did not work on this transdimensional ABC algorithm with application to worm invasion in Northern Alberta (arXive I reviewed last week)… Ewan also runs a blog called Another astrostatistics blog, full of goodies, incl. the one where he announces he moves to… zoology in Oxford! Anyway, this note extends our paper and a mathematically valid Savage-Dickey ratio representation to the case when the posterior distributions have no density against the Lebesgue measure. For instance for Dirichlet processes or Gaussian processes priors. Using generic Radon-Nykodim derivatives instead. The example is somewhat artificial, superimposing a Dirichlet process prior onto the Old faithful benchmark. But this is an interesting entry, worth mentioning, into the computation of Bayes factors. And the elusive nature of the Savage-Dickey ratio representation.

paradoxes in scientific inference

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , on November 23, 2012 by xi'an

This CRC Press book was sent to me for review in CHANCE: Paradoxes in Scientific Inference is written by Mark Chang, vice-president of AMAG Pharmaceuticals. The topic of scientific paradoxes is one of my primary interests and I have learned a lot by looking at Lindley-Jeffreys and Savage-Dickey paradoxes. However, I did not find a renewed sense of excitement when reading the book. The very first (and maybe the best!) paradox with Paradoxes in Scientific Inference is that it is a book from the future! Indeed, its copyright year is 2013 (!), although I got it a few months ago. (Not mentioning here the cover mimicking Escher’s “paradoxical” pictures with dices. A sculpture due to Shigeo Fukuda and apparently not quoted in the book. As I do not want to get into another dice cover polemic, I will abstain from further comments!)

Now, getting into a deeper level of criticism (!), I find the book very uneven and overall quite disappointing. (Even missing in its statistical foundations.) Esp. given my initial level of excitement about the topic!

First, there is a tendency to turn everything into a paradox: obviously, when writing a book about paradoxes, everything looks like a paradox! This means bringing into the picture every paradox known to man and then some, i.e., things that are either un-paradoxical (e.g., Gödel’s incompleteness result) or uninteresting in a scientific book (e.g., the birthday paradox, which may be surprising but is far from a paradox!). Fermat’s theorem is also quoted as a paradox, even though there is nothing in the text indicating in which sense it is a paradox. (Or is it because it is simple to express, hard to prove?!) Similarly, Brownian motion is considered a paradox, as “reconcil[ing] the paradox between two of the greatest theories of physics (…): thermodynamics and the kinetic theory of gases” (p.51) For instance, the author considers the MLE being biased to be a paradox (p.117), while omitting the much more substantial “paradox” of the non-existence of unbiased estimators of most parameters—which simply means unbiasedness is irrelevant. Or the other even more puzzling “paradox” that the secondary MLE derived from the likelihood associated with the distribution of a primary MLE may differ from the primary. (My favourite!)

When the null hypothesis is rejected, the p-value is the probability of the type I error.Paradoxes in Scientific Inference (p.105)

The p-value is the conditional probability given H0.” Paradoxes in Scientific Inference (p.106)

Second, the depth of the statistical analysis in the book is often found missing. For instance, Simpson’s paradox is not analysed from a statistical perspective, only reported as a fact. Sticking to statistics, take for instance the discussion of Lindley’s paradox. The author seems to think that the problem is with the different conclusions produced by the frequentist, likelihood, and Bayesian analyses (p.122). This is completely wrong: Lindley’s (or Lindley-Jeffreys‘s) paradox is about the lack of significance of Bayes factors based on improper priors. Similarly, when the likelihood ratio test is introduced, the reference threshold is given as equal to 1 and no mention is later made of compensating for different degrees of freedom/against over-fitting. The discussion about p-values is equally garbled, witness the above quote which (a) conditions upon the rejection and (b) ignores the dependence of the p-value on a realized random variable. Continue reading

Bayesian model selection

Posted in Books, R, Statistics with tags , , , , , , , , , on December 8, 2010 by xi'an

Last week, I received a box of books from the International Statistical Review, for reviewing them. I thus grabbed the one whose title was most appealing to me, namely Bayesian Model Selection and Statistical Modeling by Tomohiro Ando. I am indeed interested in both the nature of testing hypotheses or more accurately of assessing models, as discussed in both my talk at the Seminar of philosophy of mathematics at Université Paris Diderot a few days ago and the post on Murray Aitkin’s alternative, and the computational aspects of the resulting Bayesian procedures, including evidence, the Savage-Dickey paradox, nested sampling, harmonic mean estimators, and more…

After reading through the book, I am alas rather disappointed. What I consider to be innovative or at least “novel” parts with comparison with existing books (like Chen, Shao and Ibrahim, 2000, which remains a reference on this topic) is based on papers written by the author over the past five years and it is mostly a sort of asymptotic Bayes analysis that I do not see as particularly Bayesian, because involving the “true” distribution of the data. The coverage of the existing literature on Bayesian model choice is often incomplete and sometimes misses the point, as discussed below. This is especially true for the computational aspects that are generally mistreated or at least not treated in a way from which a newcomer to the field would benefit. The author often takes complex econometric examples for illustration, which is nice; however, he does not pursue the details far enough for the reader to be able to replicate the study without further reading. (An example is given by the coverage of stochastic volatility in Section 4.5.1, pages 83-84.) The few exercises at the end of each chapter are rather unhelpful, often sounding rather like notes than true problems (an extreme case is Exercise 6 pages 196-197 which introduces the Metropolis-Hastings algorithm within the exercise (although it has already been defined on pages 66-67) and then asks to derive the marginal likelihood estimator. Another such exercise on page 164-165 introduces the theory of DNA microarrays and gene expression in ten lines (which are later repeated verbatim on page 227), then asks to identify marker genes responsible for a certain trait.) The overall feeling after reading this book is thus that the contribution to the field of Bayesian Model Selection and Statistical Modeling is too limited and disorganised for the book to be recommended as “helping you choose the right Bayesian model” (backcover).

Continue reading