Archive for Ockham’s razor

Naturally amazed at non-identifiability

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on May 27, 2020 by xi'an

A Nature paper by Stilianos Louca and Matthew W. Pennell,  Extant time trees are consistent with a myriad of diversification histories, comes to the extraordinary conclusion that birth-&-death evolutionary models cannot distinguish between several scenarios given the available data! Namely, stem ages and daughter lineage ages cannot identify the speciation rate function λ(.), the extinction rate function μ(.)  and the sampling fraction ρ inherently defining the deterministic ODE leading to the number of species predicted at any point τ in time, N(τ). The Nature paper does not seem to make a point beyond the obvious and I am rather perplexed at why it got published [and even highlighted]. A while ago, under the leadership of Steve, PNAS decided to include statistician reviewers for papers relying on statistical arguments. It could time for Nature to move there as well.

“We thus conclude that two birth-death models are congruent if and only if they have the same rp and the same λp at some time point in the present or past.” [S.1.1, p.4]

Or, stated otherwise, that a tree structured dataset made of branch lengths are not enough to identify two functions that parameterise the model. The likelihood looks like

\frac{\rho^{n-1}\Psi(\tau_1,\tau_0)}{1-E(\tau)}\prod_{i=1}^n \lambda(\tau_i)\Psi(s_{i,1},\tau_i)\Psi(s_{i,2},\tau_i)$

where E(.) is the probability to survive to the present and ψ(s,t) the probability to survive and be sampled between times s and t. Sort of. Both functions depending on functions λ(.) and  μ(.). (When the stem age is unknown, the likelihood changes a wee bit, but with no changes in the qualitative conclusions. Another way to write this likelihood is in term of the speciation rate λp


where Λp is the integrated rate, but which shares the same characteristic of being unable to identify the functions λ(.) and μ(.). While this sounds quite obvious the paper (or rather the supplementary material) goes into fairly extensive mode, including “abstract” algebra to define congruence.


“…we explain why model selection methods based on parsimony or “Occam’s razor”, such as the Akaike Information Criterion and the Bayesian Information Criterion that penalize excessive parameters, generally cannot resolve the identifiability issue…” [S.2, p15]

As illustrated by the above quote, the supplementary material also includes a section about statistical model selections techniques failing to capture the issue, section that seems superfluous or even absurd once the fact that the likelihood is constant across a congruence class has been stated.

back to Ockham’s razor

Posted in Statistics with tags , , , , , , , , , on July 31, 2019 by xi'an

“All in all, the Bayesian argument for selecting the MAP model as the single ‘best’ model is suggestive but not compelling.”

Last month, Jonty Rougier and Carey Priebe arXived a paper on Ockham’s factor, with a generalisation of a prior distribution acting as a regulariser, R(θ). Calling on the late David MacKay to argue that the evidence involves the correct penalising factor although they acknowledge that his central argument is not absolutely convincing, being based on a first-order Laplace approximation to the posterior distribution and hence “dubious”. The current approach stems from the candidate’s formula that is already at the core of Sid Chib’s method. The log evidence then decomposes as the sum of the maximum log-likelihood minus the log of the posterior-to-prior ratio at the MAP estimator. Called the flexibility.

“Defining model complexity as flexibility unifies the Bayesian and Frequentist justifications for selecting a single model by maximizing the evidence.”

While they bring forward rational arguments to consider this as a measure model complexity, it remains at an informal level in that other functions of this ratio could be used as well. This is especially hard to accept by non-Bayesians in that it (seriously) depends on the choice of the prior distribution, as all transforms of the evidence would. I am thus skeptical about the reception of the argument by frequentists…

over-confident about mis-specified models?

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , , , on April 30, 2019 by xi'an

Ziheng Yang and Tianqui Zhu published a paper in PNAS last year that criticises Bayesian posterior probabilities used in the comparison of models under misspecification as “overconfident”. The paper is written from a phylogeneticist point of view, rather than from a statistician’s perspective, as shown by the Editor in charge of the paper [although I thought that, after Steve Fienberg‘s intervention!, a statistician had to be involved in a submission relying on statistics!] a paper , but the analysis is rather problematic, at least seen through my own lenses… With no statistical novelty, apart from looking at the distribution of posterior probabilities in toy examples. The starting argument is that Bayesian model comparison is often reporting posterior probabilities in favour of a particular model that are close or even equal to 1.

“The Bayesian method is widely used to estimate species phylogenies using molecular sequence data. While it has long been noted to produce spuriously high posterior probabilities for trees or clades, the precise reasons for this over confidence are unknown. Here we characterize the behavior of Bayesian model selection when the compared models are misspecified and demonstrate that when the models are nearly equally wrong, the method exhibits unpleasant polarized behaviors,supporting one model with high confidence while rejecting others. This provides an explanation for the empirical observation of spuriously high posterior probabilities in molecular phylogenetics.”

The paper focus on the behaviour of posterior probabilities to strongly support a model against others when the sample size is large enough, “even when” all models are wrong, the argument being apparently that the correct output should be one of equal probability between models, or maybe a uniform distribution of these model probabilities over the probability simplex. Why should it be so?! The construction of the posterior probabilities is based on a meta-model that assumes the generating model to be part of a list of mutually exclusive models. It does not account for cases where “all models are wrong” or cases where “all models are right”. The reported probability is furthermore epistemic, in that it is relative to the measure defined by the prior modelling, not to a promise of a frequentist stabilisation in a ill-defined asymptotia. By which I mean that a 99.3% probability of model M¹ being “true”does not have a universal and objective meaning. (Moderation note: the high polarisation of posterior probabilities was instrumental in our investigation of model choice with ABC tools and in proposing instead error rates in ABC random forests.)

The notion that two models are equally wrong because they are both exactly at the same Kullback-Leibler distance from the generating process (when optimised over the parameter) is such a formal [or cartoonesque] notion that it does not make much sense. There is always one model that is slightly closer and eventually takes over. It is also bizarre that the argument does not account for the complexity of each model and the resulting (Occam’s razor) penalty. Even two models with a single parameter are not necessarily of intrinsic dimension one, as shown by DIC. And thus it is not a surprise if the posterior probability mostly favours one versus the other. In any case, an healthily sceptic approach to Bayesian model choice means looking at the behaviour of the procedure (Bayes factor, posterior probability, posterior predictive, mixture weight, &tc.) under various assumptions (model M¹, M², &tc.) to calibrate the numerical value, rather than taking it at face value. By which I do not mean a frequentist evaluation of this procedure. Actually, it is rather surprising that the authors of the PNAS paper do not jump on the case when the posterior probability of model M¹ say is uniformly distributed, since this would be a perfect setting when the posterior probability is a p-value. (This is also what happens to the bootstrapped version, see the last paragraph of the paper on p.1859, the year Darwin published his Origin of Species.)

Bayesian methods in cosmology

Posted in Statistics with tags , , , , , , , , , , , , on January 18, 2017 by xi'an

A rather massive document was arXived a few days ago by Roberto Trotta on Bayesian methods for cosmology, in conjunction with an earlier winter school, the 44th Saas Fee Advanced Course on Astronomy and Astrophysics, “Cosmology with wide-field surveys”. While I never had the opportunity to give a winter school in Saas Fee, I will give next month a course on ABC to statistics graduates in another Swiss dream location, Les Diablerets.  And next Fall a course on ABC again but to astronomers and cosmologists, in Autrans, near Grenoble.

The course document is an 80 pages introduction to probability and statistics, in particular Bayesian inference and Bayesian model choice. Including exercises and references. As such, it is rather standard in that the material could be found as well in textbooks. Statistics textbooks.

When introducing the Bayesian perspective, Roberto Trotta advances several arguments in favour of this approach. The first one is that it is generally easier to follow a Bayesian approach when compared with seeking a non-Bayesian one, while recovering long-term properties. (Although there are inconsistent Bayesian settings.) The second one is that a Bayesian modelling allows to handle naturally nuisance parameters, because there are essentially no nuisance parameters. (Even though preventing small world modelling may lead to difficulties as in the Robbins-Wasserman paradox.) The following two reasons are the incorporation of prior information and the appeal on conditioning on the actual data.

trottaThe document also includes this above and nice illustration of the concentration of measure as the dimension of the parameter increases. (Although one should not over-interpret it. The concentration does not occur in the same way for a normal distribution for instance.) It further spends quite some space on the Bayes factor, its scaling as a natural Occam’s razor,  and the comparison with p-values, before (unsurprisingly) introducing nested sampling. And the Savage-Dickey ratio. The conclusion of this model choice section proposes some open problems, with a rather unorthodox—in the Bayesian sense—line on the justification of priors and the notion of a “correct” prior (yeech!), plus an musing about adopting a loss function, with which I quite agree.

a Bayesian criterion for singular models [discussion]

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , , on October 10, 2016 by xi'an

London Docks 12/02/09[Here is the discussion Judith Rousseau and I wrote about the paper by Mathias Drton and Martyn Plummer, a Bayesian criterion for singular models, which was discussed last week at the Royal Statistical Society. There is still time to send a written discussion! Note: This post was written using the latex2wp converter.]

It is a well-known fact that the BIC approximation of the marginal likelihood in a given irregular model {\mathcal M_k} fails or may fail. The BIC approximation has the form

\displaystyle BIC_k = \log p(\mathbf Y_n| \hat \pi_k, \mathcal M_k) - d_k \log n /2

where {d_k } corresponds on the number of parameters to be estimated in model {\mathcal M_k}. In irregular models the dimension {d_k} typically does not provide a good measure of complexity for model {\mathcal M_k}, at least in the sense that it does not lead to an approximation of

\displaystyle \log m(\mathbf Y_n |\mathcal M_k) = \log \left( \int_{\mathcal M_k} p(\mathbf Y_n| \pi_k, \mathcal M_k) dP(\pi_k|k )\right) \,.

A way to understand the behaviour of {\log m(\mathbf Y_n |\mathcal M_k) } is through the effective dimension

\displaystyle \tilde d_k = -\lim_n \frac{ \log P( \{ KL(p(\mathbf Y_n| \pi_0, \mathcal M_k) , p(\mathbf Y_n| \pi_k, \mathcal M_k) ) \leq 1/n | k ) }{ \log n}

when it exists, see for instance the discussions in Chambaz and Rousseau (2008) and Rousseau (2007). Watanabe (2009} provided a more precise formula, which is the starting point of the approach of Drton and Plummer:

\displaystyle \log m(\mathbf Y_n |\mathcal M_k) = \log p(\mathbf Y_n| \hat \pi_k, \mathcal M_k) - \lambda_k(\pi_0) \log n + [m_k(\pi_0) - 1] \log \log n + O_p(1)

where {\pi_0} is the true parameter. The authors propose a clever algorithm to approximate of the marginal likelihood. Given the popularity of the BIC criterion for model choice, obtaining a relevant penalized likelihood when the models are singular is an important issue and we congratulate the authors for it. Indeed a major advantage of the BIC formula is that it is an off-the-shelf crierion which is implemented in many softwares, thus can be used easily by non statisticians. In the context of singular models, a more refined approach needs to be considered and although the algorithm proposed by the authors remains quite simple, it requires that the functions { \lambda_k(\pi)} and {m_k(\pi)} need be known in advance, which so far limitates the number of problems that can be thus processed. In this regard their equation (3.2) is both puzzling and attractive. Attractive because it invokes nonparametric principles to estimate the underlying distribution; puzzling because why should we engage into deriving an approximation like (3.1) and call for Bayesian principles when (3.1) is at best an approximation. In this case why not just use a true marginal likelihood?

1. Why do we want to use a BIC type formula?

The BIC formula can be viewed from a purely frequentist perspective, as an example of penalised likelihood. The difficulty then stands into choosing the penalty and a common view on these approaches is to choose the smallest possible penalty that still leads to consistency of the model choice procedure, since it then enjoys better separation rates. In this case a {\log \log n} penalty is sufficient, as proved in Gassiat et al. (2013). Now whether or not this is a desirable property is entirely debatable, and one might advocate that for a given sample size, if the data fits the smallest model (almost) equally well, then this model should be chosen. But unless one is specifying what equally well means, it does not add much to the debate. This also explains the popularity of the BIC formula (in regular models), since it approximates the marginal likelihood and thus benefits from the Bayesian justification of the measure of fit of a model for a given data set, often qualified of being a Bayesian Ockham’s razor. But then why should we not compute instead the marginal likelihood? Typical answers to this question that are in favour of BIC-type formula include: (1) BIC is supposingly easier to compute and (2) BIC does not call for a specification of the prior on the parameters within each model. Given that the latter is a difficult task and that the prior can be highly influential in non-regular models, this may sound like a good argument. However, it is only apparently so, since the only justification of BIC is purely asymptotic, namely, in such a regime the difficulties linked to the choice of the prior disappear. This is even more the case for the sBIC criterion, since it is only valid if the parameter space is compact. Then the impact of the prior becomes less of an issue as non informative priors can typically be used. With all due respect, the solution proposed by the authors, namely to use the posterior mean or the posterior mode to allow for non compact parameter spaces, does not seem to make sense in this regard since they depend on the prior. The same comments apply to the author’s discussion on Prior’s matter for sBIC. Indeed variations of the sBIC could be obtained by penalizing for bigger models via the prior on the weights, for instance as in Mengersen and Rousseau (2011) or by, considering repulsive priors as in Petralia et al. (20120, but then it becomes more meaningful to (again) directly compute the marginal likelihood. Remains (as an argument in its favour) the relative computational ease of use of sBIC, when compared with the marginal likelihood. This simplification is however achieved at the expense of requiring a deeper knowledge on the behaviour of the models and it therefore looses the off-the-shelf appeal of the BIC formula and the range of applications of the method, at least so far. Although the dependence of the approximation of {\log m(\mathbf Y_n |\mathcal M_k)} on {\mathcal M_j }, $latex {j \leq k} is strange, this does not seem crucial, since marginal likelihoods in themselves bring little information and they are only meaningful when compared to other marginal likelihoods. It becomes much more of an issue in the context of a large number of models.

2. Should we care so much about penalized or marginal likelihoods ?

Marginal or penalized likelihoods are exploratory tools in a statistical analysis, as one is trying to define a reasonable model to fit the data. An unpleasant feature of these tools is that they provide numbers which in themselves do not have much meaning and can only be used in comparison with others and without any notion of uncertainty attached to them. A somewhat richer approach of exploratory analysis is to interrogate the posterior distributions by either varying the priors or by varying the loss functions. The former has been proposed in van Havre et l. (2016) in mixture models using the prior tempering algorithm. The latter has been used for instance by Yau and Holmes (2013) for segmentation based on Hidden Markov models. Introducing a decision-analytic perspective in the construction of information criteria sounds to us like a reasonable requirement, especially when accounting for the current surge in studies of such aspects.

[Posted as arXiv:1610.02503]