Archive for AIC

Lindley’s paradox as a loss of resolution

Posted in Books, pictures, Statistics with tags , , , , , , , , on November 9, 2016 by xi'an

“The principle of indifference states that in the absence of prior information, all mutually exclusive models should be assigned equal prior probability.”

lindleypColin LaMont and Paul Wiggins arxived a paper on Lindley’s paradox a few days ago. The above quote is the (standard) argument for picking (½,½) partition between the two hypotheses, which I object to if only because it does not stand for multiple embedded models. The main point in the paper is to argue about the loss of resolution induced by averaging against the prior, as illustrated by the picture above for the N(0,1) versus N(μ,1) toy problem. What they call resolution is the lowest possible mean estimate for which the null is rejected by the Bayes factor (assuming a rejection for Bayes factors larger than 1). While the detail is missing, I presume the different curves on the lower panel correspond to different choices of L when using U(-L,L) priors on μ… The “Bayesian rejoinder” to the Lindley-Bartlett paradox (p.4) is in tune with my interpretation, namely that as the prior mass under the alternative gets more and more spread out, there is less and less prior support for reasonable values of the parameter, hence a growing tendency to accept the null. This is an illustration of the long-lasting impact of the prior on the posterior probability of the model, because the data cannot impact the tails very much.

“If the true prior is known, Bayesian inference using the true prior is optimal.”

This sentence and the arguments following is meaningless in my opinion as knowing the “true” prior makes the Bayesian debate superfluous. If there was a unique, Nature provided, known prior π, it would loose its original meaning to become part of the (frequentist) model. The argument is actually mostly used in negative, namely that since it is not know we should not follow a Bayesian approach: this is, e.g., the main criticism in Inferential Models. But there is no such thing as a “true” prior! (Or a “true’ model, all things considered!) In the current paper, this pseudo-natural approach to priors is utilised to justify a return to the pseudo-Bayes factors of the 1990’s, when one part of the data is used to stabilise and proper-ise the (improper) prior, and a second part to run the test per se. This includes an interesting insight on the limiting cases of partitioning corresponding to AIC and BIC, respectively, that I had not seen before. With the surprising conclusion that “AIC is the derivative of BIC”!

model selection and multiple testing

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , on October 23, 2015 by xi'an


Ritabrata Dutta, Malgorzata Bogdan and Jayanta Ghosh recently arXived a survey paper on model selection and multiple testing. Which provides a good opportunity to reflect upon traditional Bayesian approaches to model choice. And potential alternatives. On my way back from Madrid, where I got a bit distracted when flying over the South-West French coast, from Biarritz to Bordeaux. Spotting the lake of Hourtain, where I spent my military training month, 29 years ago!

“On the basis of comparison of AIC and BIC, we suggest tentatively that model selection rules should be used for the purpose for which they were introduced. If they are used for other problems, a fresh justification is desirable. In one case, justification may take the form of a consistency theorem, in the other some sort of oracle inequality. Both may be hard to prove. Then one should have substantial numerical assessment over many different examples.”

The authors quickly replace the Bayes factor with BIC, because it is typically consistent. In the comparison between AIC and BIC they mention the connundrum of defining a prior on a nested model from the prior on the nesting model, a problem that has not been properly solved in my opinion. The above quote with its call to a large simulation study reminded me of the paper by Arnold & Loeppky about running such studies through ecdfs. That I did not see as solving the issue. The authors also discuss DIC and Lasso, without making much of a connection between those, or with the above. And then reach the parametric empirical Bayes approach to model selection exemplified by Ed George’s and Don Foster’s 2000 paper. Which achieves asymptotic optimality for posterior prediction loss (p.9). And which unifies a wide range of model selection approaches.

A second part of the survey considers the large p setting, where BIC is not a good approximation to the Bayes factor (when testing whether or not all mean entries are zero). And recalls that there are priors ensuring consistency for the Bayes factor in this very [restrictive] case. Then, in Section 4, the authors move to what they call “cross-validatory Bayes factors”, also known as partial Bayes factors and pseudo-Bayes factors, where the data is split to (a) make the improper prior proper and (b) run the comparison or test on the remaining data. They also show the surprising result that, provided the fraction of the data used to proper-ise the prior does not converge to one, the X validated Bayes factor remains consistent [for the special case above]. The last part of the paper concentrates on multiple testing but is more tentative and conjecturing about convergence results, centring on the differences between full Bayes and empirical Bayes. Then the plane landed in Paris and I stopped my reading, not feeling differently about the topic than when the plane started from Madrid.

a unified treatment of predictive model comparison

Posted in Books, Statistics, University life with tags , , , , , , , , , on June 16, 2015 by xi'an

“Applying various approximation strategies to the relative predictive performance derived from predictive distributions in frequentist and Bayesian inference yields many of the model comparison techniques ubiquitous in practice, from predictive log loss cross validation to the Bayesian evidence and Bayesian information criteria.”

Michael Betancourt (Warwick) just arXived a paper formalising predictive model comparison in an almost Bourbakian sense! Meaning that he adopts therein a very general representation of the issue, with minimal assumptions on the data generating process (excluding a specific metric and obviously the choice of a testing statistic). He opts for an M-open perspective, meaning that this generating process stands outside the hypothetical statistical model or, in Lindley’s terms, a small world. Within this paradigm, the only way to assess the fit of a model seems to be through the predictive performances of that model. Using for instance an f-divergence like the Kullback-Leibler divergence, based on the true generated process as the reference. I think this however puts a restriction on the choice of small worlds as the probability measure on that small world has to be absolutely continuous wrt the true data generating process for the distance to be finite. While there are arguments in favour of absolutely continuous small worlds, this assumes a knowledge about the true process that we simply cannot gather. Ignoring this difficulty, a relative Kullback-Leibler divergence can be defined in terms of an almost arbitrary reference measure. But as it still relies on the true measure, its evaluation proceeds via cross-validation “tricks” like jackknife and bootstrap. However, on the Bayesian side, using the prior predictive links the Kullback-Leibler divergence with the marginal likelihood. And Michael argues further that the posterior predictive can be seen as the unifying tool behind information criteria like DIC and WAIC (widely applicable information criterion). Which does not convince me towards the utility of those criteria as model selection tools, as there is too much freedom in the way approximations are used and a potential for using the data several times.

An objective prior that unifies objective Bayes and information-based inference

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , on June 8, 2015 by xi'an

vale9During the Valencia O’Bayes 2015 meeting, Colin LaMont and Paul Wiggins arxived a paper entitled “An objective prior that unifies objective Bayes and information-based inference”. It would have been interesting to have the authors in Valencia, as they make bold claims about their w-prior as being uniformly and maximally uninformative. Plus achieving this unification advertised in the title of the paper. Meaning that the free energy (log transform of the inverse evidence) is the Akaike information criterion.

The paper starts by defining a true prior distribution (presumably in analogy with the true value of the parameter?) and generalised posterior distributions as associated with any arbitrary prior. (Some notations are imprecise, check (3) with the wrong denominator or the predictivity that is supposed to cover N new observations on p.2…) It then introduces a discretisation by considering all models within a certain Kullback divergence δ to be undistinguishable. (A definition that does not account for the assymmetry of the Kullback divergence.) From there, it most surprisingly [given the above discretisation] derives a density on the whole parameter space

\pi(\theta) \propto \text{det} I(\theta)^{1/2} (N/2\pi \delta)^{K/2}

where N is the number of observations and K the dimension of θ. Dimension which may vary. The dependence on N of the above is a result of using the predictive on N points instead of one. The w-prior is however defined differently: “as the density of indistinguishable models such that the multiplicity is unity for all true models”. Where the log transform of the multiplicity is the expected log marginal likelihood minus the expected log predictive [all expectations under the sampling distributions, conditional on θ]. Rather puzzling in that it involves the “true” value of the parameter—another notational imprecision, since it has to hold for all θ’s—as well as possibly improper priors. When the prior is improper, the log-multiplicity is a difference of two terms such that the first term depends on the constant used with the improper prior, while the second one does not…  Unless the multiplicity constraint also determines the normalising constant?! But this does not seem to be the case when considering the following section on normalising the w-prior. Mentioning a “cutoff” for the integration that seems to pop out of nowhere. Curiouser and curiouser. Due to this unclear handling of infinite mass priors, and since the claimed properties of uniform and maximal uninformativeness are not established in any formal way, and since the existence of a non-asymptotic solution to the multiplicity equation is neither demonstrated, I quickly lost interest in the paper. Which does not contain any worked out example. Read at your own risk!

reading classics (#4,5,6)

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , , , , , on December 9, 2013 by xi'an

La Défense from Paris-Dauphine, Nov. 15, 2012This week, thanks to a lack of clear instructions (from me) to my students in the Reading Classics student seminar, four students showed up with a presentation! Since I had planned for two teaching blocks, three of them managed to fit within the three hours, while the last one nicely accepted to wait till next week to present a paper by David Cox…

The first paper discussed therein was A new look at the statistical model identification, written in 1974 by Hirotugu Akaike. And presenting the AIC criterion. My student Rozan asked to give the presentation in French as he struggled with English, but it was still a challenge for him and he ended up being too close to the paper to provide a proper perspective on why AIC is written the way it is and why it is (potentially) relevant for model selection. And why it is not such a definitive answer to the model selection problem. This is not the simplest paper in the list, to be sure, but some intuition could have been built from the linear model, rather than producing the case of an ARMA(p,q) model without much explanation. (I actually wonder why the penalty for this model is (p+q)/T, rather than (p+q+1)/T for the additional variance parameter.) Or simulation ran on the performances of AIC versus other xIC’s…

The second paper was another classic, the original GLM paper by John Nelder and his coauthor Wedderburn, published in 1972 in Series B. A slightly easier paper, in that the notion of a generalised linear model is presented therein, with mathematical properties linking the (conditional) mean of the observation with the parameters and several examples that could be discussed. Plus having the book as a backup. My student Ysé did a reasonable job in presenting the concepts, but she would have benefited from this extra-week in including properly the computations she ran in R around the glm() function… (The definition of the deviance was somehow deficient, although this led to a small discussion during the class as to how the analysis of deviance was extending the then flourishing analysis of variance.) In the generic definition of the generalised linear models, I was also reminded of the
generality of the nuisance parameter modelling, which made the part of interest appear as an exponential shift on the original (nuisance) density.

The third paper, presented by Bong, was yet another classic, namely the FDR paper, Controlling the false discovery rate, of Benjamini and Hochberg in Series B (which was recently promoted to the should-have-been-a-Read-Paper category by the RSS Research Committee and discussed at the Annual RSS Conference in Edinburgh four years ago, as well as published in Series B). This 2010 discussion would actually have been a good start to discuss the paper in class, but Bong was not aware of it and mentioned earlier papers extending the 1995 classic. She gave a decent presentation of the problem and of the solution of Benjamini and Hochberg but I wonder how much of the novelty of the concept the class grasped. (I presume everyone was getting tired by then as I was the only one asking questions.) The slides somewhat made it look too much like a simulation experiment… (Unsurprisingly, the presentation did not include any Bayesian perspective on the approach, even though they are quite natural and emerged very quickly once the paper was published. I remember for instance the Valencia 7 meeting in Teneriffe where Larry Wasserman discussed about the Bayesian-frequentist agreement in multiple testing.)

reading classics (#1)

Posted in Books, Statistics, University life with tags , , , , , , , , , on October 31, 2013 by xi'an

Here we are, back in a new school year and with new students reading the classics. Today, Ilaria Masiani started the seminar with a presentation of Spiegelhalter et al. 2002 DIC paper, already heavily mentioned on this blog. Here are the slides, posted on slideshare (if you know of another website housing and displaying slides, let me know: the incompatibility between Firefox and slideshare drives me crazy!, well, almost…)

I have already said a lot about DIC on this blog so will not add a repetition of my reservations. I enjoyed the link with the log scores and the Kullback-Leibler divergence, but failed to see a proper intuition for defining the effective number of parameters the way it is defined in the paper… The presentation was quite faithful to the original and, as is usual in the reading seminars (esp. the first one of the year), did not go far enough (for my taste) in the critical evaluation of the themes therein. Maybe an idea for next year would be to ask one of my PhD students to give the zeroth seminar…

can you help?

Posted in Statistics, University life with tags , , , , , , , on October 12, 2013 by xi'an

An email received a few days ago:

Can you help me answering my query about AIC and DIC?

I want to compare the predictive power of a non Bayesian model (GWR, Geographically weighted regression) and a Bayesian hierarchical model (spLM).
For GWR, DIC is not defined, but AIC is.
For  spLM, AIC is not defined, but DIC is.

How can I compare the predictive ability of these two models? Does it make sense to compare AIC of one with DIC of the other?

I did not reply as the answer is in the question: the numerical values of AIC and DIC do not compare. And since one estimation is Bayesian while the other is not, I do not think the predictive abilities can be compared. This is not even mentioning my reluctance to use DIC…as renewed in yesterday’s post.