## Archive for Bayes factors

Posted in Books, Statistics with tags , , , , , , , , on March 23, 2013 by xi'an

This morning session at the workshop Recent Advances in statistical inference: theory and case studies was a true blessing for anyone working in Bayesian model choice! And it did give me ideas to complete my current paper on the Jeffreys-Lindley paradox, and more. Attending the talks in the historical Gioachino Rossini room of the fabulous Café Pedrocchi with the Italian spring blue sky as a background surely helped! (It is only beaten by this room of Ca’Foscari overlooking the Gran Canale where we had a workshop last Fall…)

First, Phil Dawid gave a talk on his current work with Monica Musio (who gave a preliminary talk on this in Venezia last fall) on the use of new score functions to compare statistical models. While the regular Bayes factor is based on the log score, comparing the logs of the predictives at the observed data, different functions of the predictive q can be used, like the Hyvärinen score

$S(x,q)=\Delta\sqrt{q(x)}\big/\sqrt{q(x)}$

which offers the immense advantage of being independent of the normalising constant and hence can also be used for improper priors. As written above, a very deep finding that could at last allow for the comparison of models based on improper priors without requiring convoluted constructions (see below) to make the “constants meet”. I first thought the technique was suffering from the same shortcoming as Murray Aitkin’s integrated likelihood, but I eventually figured out (where) I was wrong!

The second talk was given by Ed George, who spoke on his recent research with Veronika Rocková dealing with variable selection via an EM algorithm that proceeds much much faster to the optimal collection of variables, when compared with the DMVS solution of George and McCulloch (JASA, 1993). (I remember discussing this paper with Ed in Laramie during the IMS meeting in the summer of 1993.) This resurgence of the EM algorithm in this framework is both surprising (as the missing data structure represented by the variable indicators could have been exploited much earlier) and exciting, because it opens a new way to explore the most likely models in this variable selection setting and to eventually produce the median model of Berger and Barbieri (Annals of Statistics, 2004). In addition, this approach allows for a fast comparison of prior modellings on the missing variable indicators, showing in some examples a definitive improvement brought by a Markov random field structure. Given that it also produces a marginal posterior density on the indicators, values of hyperparameters can be assessed, escaping the Jeffreys-Lindley paradox (which was clearly a central piece of today’s talks and discussions). I would like to see more details on the MRF part, as I wonder which structure is part of the input and which one is part of the inference.

The third talk of the morning was Susie Bayarri’s, about a collection of desiderata or criteria for building an objective prior in model comparison and achieving a manageable closed-form solution in the case of the normal linear model. While I somehow disagree with the information criterion, which states that the divergence of the likelihood ratio should imply a corresponding divergence of the Bayes factor. While I definitely agree with the invariance argument leading to using the same (improper) prior over parameters common to models under comparison, this may sound too much of a trick to outsiders, especially when accounting for the score solution of Dawid and Musio. Overall, though, I liked the outcome of a coherence reference solution for linear models that could clearly be used as a default in this setting, esp. given the availability of an R package called BayesVarSel. (Even if I also like our simpler solution developped in the incoming edition of Bayesian Core, also available in the bayess R package!) In his discussion, Guido Consonni highlighted the philosophical problem of considering “common paramaters”, a perspective I completely subscribe to, even though I think all that matters is the justification of having a common prior over formally equivalent parameters, even though this may sound like a pedantic distinction to many!

Due to a meeting of the scientific committee of the incoming O’Bayes 2013 meeting (in Duke, December, more about this soon!), whose most members were attending this workshop, I missed the beginning of Alan Aggresti’s talk and could not catch up with the central problem he was addressing (the pianist on the street outside started pounding on his instrument as if intent to break it apart!). A pity as problems with contingency tables are certainly of interest to me… By the end of Alan’s talk, I wished someone would shoot the pianist playing outside (even though he was reasonably gifted) as I had gotten a major headache from his background noise. Following Noel Cressie’s talk proved just as difficult, although I could see his point in comparing very diverse predictors for big Data problems without much of a model structure and even less of a  and I decided to call the day off, despite wishing to stay for Eduardo Gutiérrez-Pena’s talk on conjugate predictives and entropies which definitely interested me… Too bad really (blame the pianist!)

## reading classics (#10 and #10bis)

Posted in Books, Statistics, University life with tags , , , , , , , , , on February 28, 2013 by xi'an

Today’s classics seminar was rather special as two students were scheduled to talk. It was even more special as both students had picked (without informing me) the very same article by Berger and Sellke (1987), Testing a point-null hypothesis: the irreconcilability of p-values and evidence, on the (deep?) discrepancies between frequentist p-values and Bayesian posterior probabilities. In connection with the Lindley-Jeffreys paradox. Here are Amira Mziou’s slides:

and Jiahuan Li’s slides:

for comparison.

It was a good exercise to listen to both talks, seeing two perspectives on the same paper, and I hope the students in the class got the idea(s) behind the paper. As you can see, there were obviously repetitions between the talks, including the presentation of the lower bounds for all classes considered by Jim Berger and Tom Sellke, and the overall motivation for the comparison. Maybe as a consequence of my criticisms on the previous talk, both Amira and Jiahuan put some stress on the definitions to formally define the background of the paper. (I love the poetic line: “To prevent having a non-Bayesian reality”, although I am not sure what Amira meant by this…)

I like the connection made therein with the Lindley-Jeffreys paradox since this is the core idea behind the paper. And because I am currently writing a note about the paradox. Obviously, it was hard for the students to take a more remote stand on the reason for the comparison, from questioning .the relevance of testing point null hypotheses and of comparing the numerical values of a p-value with a posterior probability, to expecting asymptotic agreement between a p-value and a Bayes factor when both are convergent quantities, to setting the same weight on both hypotheses, to the ad-hocquery of using a drift on one to equate the p-value with the Bayes factor, to use specific priors like Jeffreys’s (which has the nice feature that it corresponds to g=n in the g-prior,  as discussed in the new edition of Bayesian Core). The students also failed to remark on the fact that the developments were only for real parameters, as the phenomenon (that the lower bound on the posterior probabilities is larger than the p-value) does not happen so universally in larger dimensions.  I would have expected more discussion from the ground, but we still got good questions and comments on a) why 0.05 matters and b) why comparing  p-values and posterior probabilities is relevant. The next paper to be discussed will be Tukey’s piece on the future of statistics.

## a remarkably simple and accurate method for computing the Bayes factor &tc.

Posted in Statistics with tags , , , , , , , , on February 13, 2013 by xi'an

This recent arXiv posting by Martin Weinberg and co-authors was pointed out to me by friends because of its title! It indeed sounded a bit inflated. And also reminded me of old style papers where the title was somehow the abstract. Like An Essay towards Solving a Problem in the Doctrine of Chances… So I had a look at it on my way to Gainesville. The paper starts from the earlier paper by Weinberg (2012) in Bayesian Analysis where he uses an HPD region to determine the Bayes factor by a safe harmonic mean estimator (an idea we already advocated earlier with Jean-Michel Marin in the San Antonio volume and with Darren Wraith in the MaxEnt volume). An extra idea is to try to optimise [against the variance of the resulting evidence] the region over which the integration is performed: “choose a domain that results in the most accurate integral with the smallest number of samples” (p.3). The authors proceed by volume peeling, using some quadrature formula for the posterior coverage of the region, either by Riemann or Lebesgue approximations (p.5). I was fairly lost at this stage and the third proposal based on adaptively managing hyperrectangles (p.7) went completely over my head! The sentence “the results are clearly worse with O() errors, but are still remarkably better for high dimensionality”(p.11) did not make sense either… The method may thus be remarkably simple, but the paper is not written in a way that conveys this impression!

## potentially relevant

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , on March 14, 2012 by xi'an

This week, freshly back from Roma, I got the reviews on our paper “Relevant statistics for Bayesian model choice” from Series B. The comments are detailed and mostly to the point, expressing concern about the relevance of the paper for statistical methodology as the major issue.  We are thus asked for a revision making a much better connection with ABC methodology.

This is not an unexpected outcome, from my point of view, because the paper is indeed quite theoretical and the mathematical assumptions required to obtain the convergence theorems are rather overwhelming… Meaning that in practical cases they cannot truly be checked. However, I think we can eventually address those concerns for two distinct reasons: first, the paper comes as a third step in a series of papers where we first identified a sufficiency property, then realised that this property was actually quite a rare occurrence, and finally made a theoretical advance as to when is a summary statistic enough (i.e. “sufficient” in the standard sense of the term!)  to conduct model choice, with a clear answer that the mean ranges of the summary statistic under each model could not intersect.  Second, my own personal view is that those assumptions needed for convergence are not of the highest importance for statistical practice (even though they are needed in the paper!) and thus that, from a methodological point of view, only the conclusion should be taken into account. It is then rather straightforward to come up with (quick-and-dirty) simulation devices to check whether a summary statistic behaves differently under both models, taking advantage of the reference table already available (instead of having to run Monte Carlo experiments with ABC basis)…

One of the comments was that maybe Bayes factors were not appropriate for conducting model choice, thus making the whole derivation irrelevant. This is a possible perspective but it can be objected that Bayes factors and posterior probabilities are used in conjunction with ABC in dozens of genetic papers. Further arguments are provided in the various replies to both of Templeton’s radical criticisms. That more empirical and model-based assessments also are available is quite correct, as demonstrated in the multicriterion approach of Olli Ratmann and co-authors. This is simply another approach, not followed by most geneticists so far…

## dimension reduction in ABC [a review's review]

Posted in Statistics, University life with tags , , , , , , , , , , , on February 27, 2012 by xi'an

What is very apparent from this study is that there is no single `best’ method of dimension reduction for ABC.

Michael Blum, Matt Nunes, Dennis Prangle and Scott Sisson just posted on arXiv a rather long review of dimension reduction methods in ABC, along with a comparison on three specific models. Given that the choice of the vector of summary statistics is presumably the most important single step in an ABC algorithm and as selecting too large a vector is bound to fall victim of the dimension curse, this is a fairly relevant review! Therein, the authors compare regression adjustments à la Beaumont et al.  (2002), subset selection methods, as in Joyce and Marjoram (2008), and projection techniques, as in Fearnhead and Prangle (2012). They add to this impressive battery of methods the potential use of AIC and BIC. (Last year after ABC in London I reported here on the use of the alternative DIC by Francois and Laval, but the paper is not in the bibliography, I wonder why.) An argument (page 22) for using AIC/BIC is that either provides indirect information about the approximation of p(θ|y) by p(θ|s); this does not seem obvious to me.

The paper also suggests a further regularisation of Beaumont et al.  (2002) by ridge regression, although L1 penalty à la Lasso would be more appropriate in my opinion for removing extraneous summary statistics. (I must acknowledge never being a big fan of ridge regression, esp. in the ad hoc version à la Hoerl and Kennard, i.e. in a non-decision theoretic approach where the hyperparameter λ is derived from the data by X-validation, since it then sounds like a poor man’s Bayes/Stein estimate, just like BIC is a first order approximation to regular Bayes factors… Why pay for the copy when you can afford the original?!) Unsurprisingly, ridge regression does better than plain regression in the comparison experiment when there are many almost collinear summary statistics, but an alternative conclusion could be that regression analysis is not that appropriate with  many summary statistics. Indeed, summary statistics are not quantities of interest but data summarising tools towards a better approximation of the posterior at a given computational cost… (I do not get the final comment, page 36, about the relevance of summary statistics for MCMC or SMC algorithms: the criterion should be the best approximation of p(θ|y) which does not depend on the type of algorithm.)

I find it quite exciting to see the development of a new range of ABC papers like this review dedicated to a better derivation of summary statistics in ABC, each with different perspectives and desideratas, as it will help us to understand where ABC works and where it fails, and how we could get beyond ABC…