back to Ockham’s razor

Posted in Statistics with tags , , , , , , , , , on July 31, 2019 by xi'an

“All in all, the Bayesian argument for selecting the MAP model as the single ‘best’ model is suggestive but not compelling.”

Last month, Jonty Rougier and Carey Priebe arXived a paper on Ockham’s factor, with a generalisation of a prior distribution acting as a regulariser, R(θ). Calling on the late David MacKay to argue that the evidence involves the correct penalising factor although they acknowledge that his central argument is not absolutely convincing, being based on a first-order Laplace approximation to the posterior distribution and hence “dubious”. The current approach stems from the candidate’s formula that is already at the core of Sid Chib’s method. The log evidence then decomposes as the sum of the maximum log-likelihood minus the log of the posterior-to-prior ratio at the MAP estimator. Called the flexibility.

“Defining model complexity as flexibility unifies the Bayesian and Frequentist justifications for selecting a single model by maximizing the evidence.”

While they bring forward rational arguments to consider this as a measure model complexity, it remains at an informal level in that other functions of this ratio could be used as well. This is especially hard to accept by non-Bayesians in that it (seriously) depends on the choice of the prior distribution, as all transforms of the evidence would. I am thus skeptical about the reception of the argument by frequentists…

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , , , on April 30, 2019 by xi'an

Ziheng Yang and Tianqui Zhu published a paper in PNAS last year that criticises Bayesian posterior probabilities used in the comparison of models under misspecification as “overconfident”. The paper is written from a phylogeneticist point of view, rather than from a statistician’s perspective, as shown by the Editor in charge of the paper [although I thought that, after Steve Fienberg‘s intervention!, a statistician had to be involved in a submission relying on statistics!] a paper , but the analysis is rather problematic, at least seen through my own lenses… With no statistical novelty, apart from looking at the distribution of posterior probabilities in toy examples. The starting argument is that Bayesian model comparison is often reporting posterior probabilities in favour of a particular model that are close or even equal to 1.

“The Bayesian method is widely used to estimate species phylogenies using molecular sequence data. While it has long been noted to produce spuriously high posterior probabilities for trees or clades, the precise reasons for this over confidence are unknown. Here we characterize the behavior of Bayesian model selection when the compared models are misspecified and demonstrate that when the models are nearly equally wrong, the method exhibits unpleasant polarized behaviors,supporting one model with high confidence while rejecting others. This provides an explanation for the empirical observation of spuriously high posterior probabilities in molecular phylogenetics.”

The paper focus on the behaviour of posterior probabilities to strongly support a model against others when the sample size is large enough, “even when” all models are wrong, the argument being apparently that the correct output should be one of equal probability between models, or maybe a uniform distribution of these model probabilities over the probability simplex. Why should it be so?! The construction of the posterior probabilities is based on a meta-model that assumes the generating model to be part of a list of mutually exclusive models. It does not account for cases where “all models are wrong” or cases where “all models are right”. The reported probability is furthermore epistemic, in that it is relative to the measure defined by the prior modelling, not to a promise of a frequentist stabilisation in a ill-defined asymptotia. By which I mean that a 99.3% probability of model M¹ being “true”does not have a universal and objective meaning. (Moderation note: the high polarisation of posterior probabilities was instrumental in our investigation of model choice with ABC tools and in proposing instead error rates in ABC random forests.)

The notion that two models are equally wrong because they are both exactly at the same Kullback-Leibler distance from the generating process (when optimised over the parameter) is such a formal [or cartoonesque] notion that it does not make much sense. There is always one model that is slightly closer and eventually takes over. It is also bizarre that the argument does not account for the complexity of each model and the resulting (Occam’s razor) penalty. Even two models with a single parameter are not necessarily of intrinsic dimension one, as shown by DIC. And thus it is not a surprise if the posterior probability mostly favours one versus the other. In any case, an healthily sceptic approach to Bayesian model choice means looking at the behaviour of the procedure (Bayes factor, posterior probability, posterior predictive, mixture weight, &tc.) under various assumptions (model M¹, M², &tc.) to calibrate the numerical value, rather than taking it at face value. By which I do not mean a frequentist evaluation of this procedure. Actually, it is rather surprising that the authors of the PNAS paper do not jump on the case when the posterior probability of model M¹ say is uniformly distributed, since this would be a perfect setting when the posterior probability is a p-value. (This is also what happens to the bootstrapped version, see the last paragraph of the paper on p.1859, the year Darwin published his Origin of Species.)

Bayesian methods in cosmology

Posted in Statistics with tags , , , , , , , , , , , , on January 18, 2017 by xi'an

A rather massive document was arXived a few days ago by Roberto Trotta on Bayesian methods for cosmology, in conjunction with an earlier winter school, the 44th Saas Fee Advanced Course on Astronomy and Astrophysics, “Cosmology with wide-field surveys”. While I never had the opportunity to give a winter school in Saas Fee, I will give next month a course on ABC to statistics graduates in another Swiss dream location, Les Diablerets.  And next Fall a course on ABC again but to astronomers and cosmologists, in Autrans, near Grenoble.

The course document is an 80 pages introduction to probability and statistics, in particular Bayesian inference and Bayesian model choice. Including exercises and references. As such, it is rather standard in that the material could be found as well in textbooks. Statistics textbooks.

When introducing the Bayesian perspective, Roberto Trotta advances several arguments in favour of this approach. The first one is that it is generally easier to follow a Bayesian approach when compared with seeking a non-Bayesian one, while recovering long-term properties. (Although there are inconsistent Bayesian settings.) The second one is that a Bayesian modelling allows to handle naturally nuisance parameters, because there are essentially no nuisance parameters. (Even though preventing small world modelling may lead to difficulties as in the Robbins-Wasserman paradox.) The following two reasons are the incorporation of prior information and the appeal on conditioning on the actual data.

The document also includes this above and nice illustration of the concentration of measure as the dimension of the parameter increases. (Although one should not over-interpret it. The concentration does not occur in the same way for a normal distribution for instance.) It further spends quite some space on the Bayes factor, its scaling as a natural Occam’s razor,  and the comparison with p-values, before (unsurprisingly) introducing nested sampling. And the Savage-Dickey ratio. The conclusion of this model choice section proposes some open problems, with a rather unorthodox—in the Bayesian sense—line on the justification of priors and the notion of a “correct” prior (yeech!), plus an musing about adopting a loss function, with which I quite agree.

a Bayesian criterion for singular models [discussion]

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , , on October 10, 2016 by xi'an

[Here is the discussion Judith Rousseau and I wrote about the paper by Mathias Drton and Martyn Plummer, a Bayesian criterion for singular models, which was discussed last week at the Royal Statistical Society. There is still time to send a written discussion! Note: This post was written using the latex2wp converter.]

It is a well-known fact that the BIC approximation of the marginal likelihood in a given irregular model ${\mathcal M_k}$ fails or may fail. The BIC approximation has the form

$\displaystyle BIC_k = \log p(\mathbf Y_n| \hat \pi_k, \mathcal M_k) - d_k \log n /2$

where ${d_k }$ corresponds on the number of parameters to be estimated in model ${\mathcal M_k}$. In irregular models the dimension ${d_k}$ typically does not provide a good measure of complexity for model ${\mathcal M_k}$, at least in the sense that it does not lead to an approximation of

$\displaystyle \log m(\mathbf Y_n |\mathcal M_k) = \log \left( \int_{\mathcal M_k} p(\mathbf Y_n| \pi_k, \mathcal M_k) dP(\pi_k|k )\right) \,.$

A way to understand the behaviour of ${\log m(\mathbf Y_n |\mathcal M_k) }$ is through the effective dimension

$\displaystyle \tilde d_k = -\lim_n \frac{ \log P( \{ KL(p(\mathbf Y_n| \pi_0, \mathcal M_k) , p(\mathbf Y_n| \pi_k, \mathcal M_k) ) \leq 1/n | k ) }{ \log n}$

when it exists, see for instance the discussions in Chambaz and Rousseau (2008) and Rousseau (2007). Watanabe (2009} provided a more precise formula, which is the starting point of the approach of Drton and Plummer:

$\displaystyle \log m(\mathbf Y_n |\mathcal M_k) = \log p(\mathbf Y_n| \hat \pi_k, \mathcal M_k) - \lambda_k(\pi_0) \log n + [m_k(\pi_0) - 1] \log \log n + O_p(1)$

where ${\pi_0}$ is the true parameter. The authors propose a clever algorithm to approximate of the marginal likelihood. Given the popularity of the BIC criterion for model choice, obtaining a relevant penalized likelihood when the models are singular is an important issue and we congratulate the authors for it. Indeed a major advantage of the BIC formula is that it is an off-the-shelf crierion which is implemented in many softwares, thus can be used easily by non statisticians. In the context of singular models, a more refined approach needs to be considered and although the algorithm proposed by the authors remains quite simple, it requires that the functions ${ \lambda_k(\pi)}$ and ${m_k(\pi)}$ need be known in advance, which so far limitates the number of problems that can be thus processed. In this regard their equation (3.2) is both puzzling and attractive. Attractive because it invokes nonparametric principles to estimate the underlying distribution; puzzling because why should we engage into deriving an approximation like (3.1) and call for Bayesian principles when (3.1) is at best an approximation. In this case why not just use a true marginal likelihood?

1. Why do we want to use a BIC type formula?

The BIC formula can be viewed from a purely frequentist perspective, as an example of penalised likelihood. The difficulty then stands into choosing the penalty and a common view on these approaches is to choose the smallest possible penalty that still leads to consistency of the model choice procedure, since it then enjoys better separation rates. In this case a ${\log \log n}$ penalty is sufficient, as proved in Gassiat et al. (2013). Now whether or not this is a desirable property is entirely debatable, and one might advocate that for a given sample size, if the data fits the smallest model (almost) equally well, then this model should be chosen. But unless one is specifying what equally well means, it does not add much to the debate. This also explains the popularity of the BIC formula (in regular models), since it approximates the marginal likelihood and thus benefits from the Bayesian justification of the measure of fit of a model for a given data set, often qualified of being a Bayesian Ockham’s razor. But then why should we not compute instead the marginal likelihood? Typical answers to this question that are in favour of BIC-type formula include: (1) BIC is supposingly easier to compute and (2) BIC does not call for a specification of the prior on the parameters within each model. Given that the latter is a difficult task and that the prior can be highly influential in non-regular models, this may sound like a good argument. However, it is only apparently so, since the only justification of BIC is purely asymptotic, namely, in such a regime the difficulties linked to the choice of the prior disappear. This is even more the case for the sBIC criterion, since it is only valid if the parameter space is compact. Then the impact of the prior becomes less of an issue as non informative priors can typically be used. With all due respect, the solution proposed by the authors, namely to use the posterior mean or the posterior mode to allow for non compact parameter spaces, does not seem to make sense in this regard since they depend on the prior. The same comments apply to the author’s discussion on Prior’s matter for sBIC. Indeed variations of the sBIC could be obtained by penalizing for bigger models via the prior on the weights, for instance as in Mengersen and Rousseau (2011) or by, considering repulsive priors as in Petralia et al. (20120, but then it becomes more meaningful to (again) directly compute the marginal likelihood. Remains (as an argument in its favour) the relative computational ease of use of sBIC, when compared with the marginal likelihood. This simplification is however achieved at the expense of requiring a deeper knowledge on the behaviour of the models and it therefore looses the off-the-shelf appeal of the BIC formula and the range of applications of the method, at least so far. Although the dependence of the approximation of ${\log m(\mathbf Y_n |\mathcal M_k)}$ on ${\mathcal M_j }$, \$latex {j \leq k} is strange, this does not seem crucial, since marginal likelihoods in themselves bring little information and they are only meaningful when compared to other marginal likelihoods. It becomes much more of an issue in the context of a large number of models.

2. Should we care so much about penalized or marginal likelihoods ?

Marginal or penalized likelihoods are exploratory tools in a statistical analysis, as one is trying to define a reasonable model to fit the data. An unpleasant feature of these tools is that they provide numbers which in themselves do not have much meaning and can only be used in comparison with others and without any notion of uncertainty attached to them. A somewhat richer approach of exploratory analysis is to interrogate the posterior distributions by either varying the priors or by varying the loss functions. The former has been proposed in van Havre et l. (2016) in mixture models using the prior tempering algorithm. The latter has been used for instance by Yau and Holmes (2013) for segmentation based on Hidden Markov models. Introducing a decision-analytic perspective in the construction of information criteria sounds to us like a reasonable requirement, especially when accounting for the current surge in studies of such aspects.

[Posted as arXiv:1610.02503]

whetstone and alum block for Occam’s razor

Posted in Statistics, University life with tags , , , , , , on August 1, 2013 by xi'an

A strange title, if any! (The whetstone is a natural hard stone used for sharpening steel instruments, like knifes or sickles and scythes, I remember my grand-fathers handling one when cutting hay and weeds. Alum is hydrated potassium aluminium sulphate and is used as a blood coagulant. Both items are naturally related with shaving and razors, if not with Occam!) The whole title of the paper published by Guido Consonni, Jon Forster and Luca La Rocca in Statistical Science is “The whetstone and the alum block: balanced objective Bayesian comparison of nested models for discrete data“. The paper builds on the notions introduced in the last Valencia meeting by Guido and Luca (and discussed by Judith Rousseau and myself).

Beyond the pun (that forced me to look for “alum stone” on Wikipedia!, and may be lost on some other non-native readers), the point in the title is to build a prior distribution aimed at the comparison of two models such that those models are more sharply distinguished: Occam’s razor would thus cut better when the smaller model is true (hence the whetstone) and less when it is not (hence the alum block)… The solution proposed by the authors is to replace the reference prior on the larger model, π1, with a moment prior à la Johnson and Rossell (2010, JRSS B) and then to turn this moment prior into an intrinsic prior à la Pérez and Berger (2002, Biometrika), making it an “intrinsic moment”. The first transform turns π1 into a non-local prior, with the aim of correcting for the imbalanced convergence rates of the Bayes factor under the null and under the alternative (this is the whetstone). The second transform accumulates more mass in the vicinity of the null model (this is the alum block). (While I like the overall perspective on intrinsic priors, the introduction is a wee confusing about them, e.g. when it mentions fictive observations instead of predictives.)

Being a referee for this paper, I read it in detail (and also because this is one of my favourite research topics!) Further, we already engaged into a fruitful discussion with Guido since the last Valencia meeting and the current paper incorporates some of our comments (and replies to others). I find the proposal of the authors clever and interesting, but not completely Bayesian. Overall, the paper provides a clearly novel methodology that calls for further studies…

Randomness through computation

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , on June 22, 2011 by xi'an

A few months ago, I received a puzzling advertising for this book, Randomness through Computation, and I eventually ordered it, despite getting a rather negative impression from reading the chapter written by Tomasso Toffoli… The book as a whole is definitely perplexing (even when correcting for this initial bias) and I would not recommend it to readers interested in simulation, in computational statistics or even in the philosophy of randomness. My overall feeling is indeed that, while there are genuinely informative and innovative chapters in this book, some chapters read more like newspeak than scientific material (mixing the Second Law of Thermodynamics, Gödel’s incompleteness theorem, quantum physics, and NP completeness within the same sentence) and do not provide a useful entry on the issue of randomness. Hence, the book is not contributing in a significant manner to my understanding of the notion. (This post also appeared on the Statistics Forum.) Continue reading

Evidence and evolution (3)

Posted in Books, Statistics with tags , , , , , , , on April 17, 2010 by xi'an

“To test a theory, you need to test it against alternatives.” (E&E, p.190)

After a gruesome (!) trek through Chapter 3 of Sober’s Evidence and Evolution: The Logic Behind the Science, I am now done with this chapter entitled “Natural selection”. The chapter is difficult to read (for someone like me) in that it seems overly repetitive, using somehow obvious arguments while missing clearcut conclusions and directions. This bend must be due to the philosophical priorities of the author but, despite opposing Brownian motion to Ornstein-Uhlenbeck processes at the beginning of the chapter —which would make for a neat parametric model comparison setting—, there is no quantitative argument nor illustration found in this third chapter that would relate to statistics. This is unfortunate as the questions of interest (testing for natural selection versus pure drift or versus phylogenetic inertia or yet for tree structure in phylogenics) could clearly be conducted at a numerical level as well, through the AIC factor or through a Bayesian alternative. The aspects I found most interesting in this chapter may therefore be deemed as marginalia by most readers, namely (a) the discussion that the outcome of a test should at all depend on the modelling assumptions (the author seems to doubt this, hence relegating Bayesian techniques to their dust-gathering shelves!), and (b) the point that parsimony is not a criterion per se.

“Data! Data! Data!’ he cried impatiently, I cannot make bricks without clay!” (Sherlock Holmes, The adventure of the copper beeches)

About the first point, the philosophical stance of the author is not completely foolproof in that he concedes that testing hypotheses without accounting for the alternative is not acceptable. My impression is that he looks at the problem from a purely dichotomous perspective, the hypothesis or [exclusive OR] the alternative being true. This is a bit caricatural as he integrates the issue of calibrating parameters under the different hypotheses, but there is a sort of logical discrepancy lurking in the background of the argument. Again working out a fully Bayesian analysis of a philogenic tree say would have clarified the issue immensely! And rejecting Bayesianism (sic!) because “there is no objective basis for producing an answer” (p.239) is a wee limited on the epistemological side! Even though I understand that the book is not trying to debate about the support for a specific evolutionary hypothesis but rather about the methods used to test such hypotheses and the logic behind these, completely worked-out example would have made my appreciation (and maybe other readers’) of Sober’s points much easier. And, again, I fail to see who could take benefits from reading this chapter. A biologist will most likely integrate the arguments and illustrations provided by Sober but could leave the chapter with a feeling of frustration at the apparent lack of conclusion. (As a statistician, I fail to understand how the likelihoods repeatedly mentioned by Sober can be computed because they never involve any parameter.)

“Parsimony does not provide a justification for ignoring the data.” (E&E, p.250)

Since I am interested in general by the negative impact of the “Ockham’s razor” argument, I find the warning signals about parsimony (given in the last third of the chapter) more palatable. Parsimony being an ill-defined concept, especially from a statistical perspective —where even the dimension of the parameter space is debatable—, no model selection is acceptable if only based on this argument.

“Instead of evaluating hypotheses in terms of how probable they say the data are, we evaluate them by estimating how accurately they’ll predict new data when fitted to old.” (E&E, p.229)

The chapter also addresses the distinction between hypothesis testing and model selection as paramount —a point I subscribed to for a long while before being convinced of the opposite by Peter Green and Jean-Michel Marin—, but I cannot get to the core of this argument. It seems Sober sees model selection through the predictive performances of the models under comparison, if the above quote is representative of his thesis. (Overall, I find the style of the chapter slightly uneven, as if the fact that some sections are adapted from earlier papers would make for different levels of depth.)

Statistically speaking, this chapter also has a difficulty with the continuity assumption. To make this point more precise, I notice there is a long discussion about reaching the optimum configuration (for polar bear fur length) under the SPD hypothesis, but I think evolution happens in discontinuous moves. The case about the local minimum in Section 3.4 is characteristic of this difficulty as a “valley” on a “fitness curve” that in essence takes three possible values over the three different types of eye designs does not really constitute a bottleneck in the optimisation process. Similarly, the temporal structure of the statistical models in Sections 3.3 and 3.5 is never mentioned, even though it needs to be defined for the tests to take place. The past versus current convergence to stationarity or equilibrium and hence to optimality under the SPD hypothesis is an issue (are we there yet?!) and so is the definition of time in the very simple 2×2 Markov chain example… And given a 2×2 contingency table like

$\begin{matrix} &\text{fixed} &\text{polymorphic}\\ \text{synonymous} &17 &42 \\ \text{nonsynonymous} &7 &2\\ \end{matrix}$

testing for independence between both factors is a standard among the standards: I thus fail to understand the lengthy and inconclusive discussion of pp.240-243.