Archive for Jeffreys-Lindley paradox

demystify Lindley’s paradox [or not]

Posted in Statistics with tags , , , , , on March 18, 2020 by xi'an

Another paper on Lindley’s paradox appeared on arXiv yesterday, by Guosheng Yin and Haolun Shi, interpreting posterior probabilities as p-values. The core of this resolution is to express a two-sided hypothesis as a combination of two one-sided hypotheses along the opposite direction, taking then advantage of the near equivalence of posterior probabilities under some non-informative prior and p-values in the later case. As already noted by George Casella and Roger Berger (1987) and presumably earlier. The point is that one-sided hypotheses are quite friendly to improper priors, since they only require a single prior distribution. Rather than two when point nulls are under consideration. The p-value created by merging both one-sided hypotheses makes little sense to me as it means testing that both θ≥0 and θ≤0, resulting in the proposal of a p-value that is twice the minimum of the one-sided p-values, maybe due to a Bonferroni correction, although the true value should be zero… I thus see little support for this approach to resolving Lindley paradox in that it bypasses the toxic nature of point-null hypotheses that require a change of prior toward a mixture supporting one hypothesis and the other. Here the posterior of the point-null hypothesis is defined in exactly the same way the p-value is defined, hence making the outcome most favourable to the agreement but not truly addressing the issue.

logic (not logistic!) regression

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , on February 12, 2020 by xi'an

A Bayesian Analysis paper by Aliaksandr Hubin, Geir Storvik, and Florian Frommlet on Bayesian logic regression was open for discussion. Here are some hasty notes I made during our group discussion in Paris Dauphine (and later turned into a discussion submitted to Bayesian Analysis):

“Originally logic regression was introduced together with likelihood based model selection, where simulated annealing served as a strategy to obtain one “best” model.”

Indeed, logic regression is not to be confused with logistic regression! Rejection of a true model in Bayesian model choice leads to Bayesian model choice and… apparently to Bayesian logic regression. The central object of interest is a generalised linear model based on a vector of binary covariates and using some if not all possible logical combinations (trees) of said covariates (leaves). The GLM is further using rather standard indicators to signify whether or not some trees are included in the regression (and hence the model). The prior modelling on the model indices sounds rather simple (simplistic?!) in that it is only function of the number of active trees, leading to an automated penalisation of larger trees and not accounting for a possible specificity of some covariates. For instance when dealing with imbalanced covariates (much more 1 than 0, say).

A first question is thus how much of a novel model this is when compared with say an analysis of variance since all covariates are dummy variables. Culling the number of trees away from the exponential of exponential number of possible covariates remains obscure but, without it, the model is nothing but variable selection in GLMs, except for “enjoying” a massive number of variables. Note that there could be a connection with variable length Markov chain models but it is not exploited there.

“…using Jeffrey’s prior for model selection has been widely criticized for not being consistent once the true model coincides with the null model.”

A second point that strongly puzzles me in the paper is its loose handling of improper priors. It is well-known that improper priors are at worst fishy in model choice settings and at best avoided altogether, to wit the Lindley-Jeffreys paradox and friends. Not only does the paper adopts the notion of a same, improper, prior on the GLM scale parameter, which is a position adopted in some of the Bayesian literature, but it also seems to be using an improper prior on each set of parameters (further undifferentiated between models). Because the priors operate on different (sub)sets of parameters, I think this jeopardises the later discourse on the posterior probabilities of the different models since they are not meaningful from a probabilistic viewpoint, with no joint distribution as a reference, neither marginal density. In some cases, p(y|M) may become infinite. Referring to a “simple Jeffrey’s” prior in this setting is therefore anything but simple as Jeffreys (1939) himself shied away from using improper priors on the parameter of interest. I find it surprising that this fundamental and well-known difficulty with improper priors in hypothesis testing is not even alluded to in the paper. Its core setting thus seems to be flawed. Now, the numerical comparison between Jeffrey’s [sic] prior and a regular g-prior exhibits close proximity and I thus wonder at the reason. Could it be that the culling and selection processes end up having the same number of variables and thus eliminate the impact of the prior? Or is it due to the recourse to a Laplace approximation of the marginal likelihood that completely escapes the lack of definition of the said marginal? Computing the normalising constant and repeating this computation while the algorithm is running ignores the central issue.

“…hereby, all states, including all possible models of maximum sized, will eventually be visited.”

Further, I found some confusion between principles and numerics. And as usual bemoan the acronym inflation with the appearance of a GMJMCMC! Where G stands for genetic (algorithm), MJ for mode jumping, and MCMC for…, well no surprise there! I was not aware of the mode jumping algorithm of Hubin and Storvik (2018), so cannot comment on the very starting point of the paper. A fundamental issue with Markov chains on discrete spaces is that the notion of neighbourhood becomes quite fishy and is highly dependent on the nature of the covariates. And the Markovian aspects are unclear because of the self-avoiding aspect of the algorithm. The novel algorithm is intricate and as such seems to require a superlative amount of calibration. Are all modes truly visited, really? (What are memetic algorithms?!)

O’Bayes 19/1 [snapshots]

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , , , , , on June 30, 2019 by xi'an

Although the tutorials of O’Bayes 2019 of yesterday were poorly attended, albeit them being great entries into objective Bayesian model choice, recent advances in MCMC methodology, and the multiple layers of BART, for which I have to blame myself for sticking the beginning of O’Bayes too closely to the end of BNP as only the most dedicated could achieve the commuting from Oxford to Coventry to reach Warwick in time, the first day of talks were well attended, despite weekend commitments, conference fatigue, and perfect summer weather! Here are some snapshots from my bench (and apologies for not covering better the more theoretical talks I had trouble to follow, due to an early and intense morning swimming lesson! Like Steve Walker’s utility based derivation of priors that generalise maximum entropy priors. But being entirely independent from the model does not sound to me like such a desirable feature… And Natalia Bochkina’s Bernstein-von Mises theorem for a location scale semi-parametric model, including a clever construct of a mixture of two Dirichlet priors to achieve proper convergence.)

Jim Berger started the day with a talk on imprecise probabilities, involving the society for imprecise probability, which I discovered while reading Keynes’ book, with a neat resolution of the Jeffreys-Lindley paradox, when re-expressing the null as an imprecise null, with the posterior of the null no longer converging to one, with a limit depending on the prior modelling, if involving a prior on the bias as well, with Chris discussing the talk and mentioning a recent work with Edwin Fong on reinterpreting marginal likelihood as exhaustive X validation, summing over all possible subsets of the data [using log marginal predictive].Håvard Rue did a follow-up talk from his Valencià O’Bayes 2015 talk on PC-priors. With a pretty hilarious introduction on his difficulties with constructing priors and counseling students about their Bayesian modelling. With a list of principles and desiderata to define a reference prior. However, I somewhat disagree with his argument that the Kullback-Leibler distance from the simpler (base) model cannot be scaled, as it is essentially a log-likelihood. And it feels like multivariate parameters need some sort of separability to define distance(s) to the base model since the distance somewhat summarises the whole departure from the simpler model. (Håvard also joined my achievement of putting an ostrich in a slide!) In his discussion, Robin Ryder made a very pragmatic recap on the difficulties with constructing priors. And pointing out a natural link with ABC (which brings us back to Don Rubin’s motivation for introducing the algorithm as a formal thought experiment).

Sara Wade gave the final talk on the day about her work on Bayesian cluster analysis. Which discussion in Bayesian Analysis I alas missed. Cluster estimation, as mentioned frequently on this blog, is a rather frustrating challenge despite the simple formulation of the problem. (And I will not mention Larry’s tequila analogy!) The current approach is based on loss functions directly addressing the clustering aspect, integrating out the parameters. Which produces the interesting notion of neighbourhoods of partitions and hence credible balls in the space of partitions. It still remains unclear to me that cluster estimation is at all achievable, since the partition space explodes with the sample size and hence makes the most probable cluster more and more unlikely in that space. Somewhat paradoxically, the paper concludes that estimating the cluster produces a more reliable estimator on the number of clusters than looking at the marginal distribution on this number. In her discussion, Clara Grazian also pointed the ambivalent use of clustering, where the intended meaning somehow diverges from the meaning induced by the mixture model.

a resolution of the Jeffreys-Lindley paradox

Posted in Books, Statistics, University life with tags , , , , on April 24, 2019 by xi'an

“…it is possible to have the best of both worlds. If one allows the significance level to decrease as the sample size gets larger (…) there will be a finite number of errors made with probability one. By allowing the critical values to diverge slowly, one may catch almost all the errors.” (p.1527)

When commenting another post, Michael Naaman pointed out to me his 2016 Electronic Journal of Statistics paper where he resolves the Jeffreys-Lindley paradox. The argument there is to consider a Type I error going to zero with the sample size n going to infinity but slowly enough for both Type I and Type II errors to go to zero. And guarantee  a finite number of errors as the sample size n grows to infinity. This translates for the Jeffreys-Lindley paradox into a pivotal quantity within the posterior probability of the null that converges to zero with n going to infinity. Hence makes it (most) agreeable with the Type I error going to zero. Except that there is little reason to assume this pivotal quantity goes to infinity with n, despite its distribution remaining constant in n. Being constant is less unrealistic, by comparison! That there exists an hypothetical sequence of observations such that the p-value and the posterior probability agree, even exactly, does not “solve” the paradox in my opinion.

statistics with improper posteriors [or not]

Posted in Statistics with tags , , , , , , on March 6, 2019 by xi'an

Last December, Gunnar Taraldsen, Jarle Tufto, and Bo H. Lindqvist arXived a paper on using priors that lead to improper posteriors and [trying to] getting away with it! The central concept in their approach is Rényi’s generalisation of Kolmogorov’s version to define conditional probability distributions from infinite mass measures by conditioning on finite mass measurable sets. A position adopted by Dennis Lindley in his 1964 book .And already discussed in a few ‘Og’s posts. While the theory thus developed indeed allows for the manipulation of improper posteriors, I have difficulties with the inferential aspects of the construct, since one cannot condition on an arbitrary finite measurable set without prior information. Things get a wee bit more outwardly when considering “data” with infinite mass, in Section 4.2, since they cannot be properly normalised (although I find the example of the degenerate multivariate Gaussian distribution puzzling as it is not a matter of improperness, since the degenerate Gaussian has a well-defined density against the right dominating measure).  The paper also discusses marginalisation paradoxes, by acknowledging that marginalisation is no longer feasible with improper quantities. And the Jeffreys-Lindley paradox, with a resolution that uses the sum of the Dirac mass at the null, δ⁰, and of the Lebesgue measure on the real line, λ, as the dominating measure. This indeed solves the issue of the arbitrary constant in the Bayes factor, since it is “the same” on the null hypothesis and elsewhere, but I do not buy the argument, as I see no reason to favour δ⁰+λ over 3.141516 δ⁰+λ or δ⁰+1.61718 λ… (This section 4.5 also illustrates that the choice of the sequence of conditioning sets has an impact on the limiting measure, in the Rényi sense.) In conclusion, after reading the paper, I remain uncertain as to how to exploit this generalisation from an inferential (Bayesian?) viewpoint, since improper posteriors do not clearly lead to well-defined inferential procedures…