Archive for Jeffreys-Lindley paradox

Bayes Factors for Forensic Decision Analyses with R [book review]

Posted in Books, R, Statistics with tags , , , , , , , , , , , , , on November 28, 2022 by xi'an

My friend EJ Wagenmaker pointed me towards an entire book on the BF by Bozza (from Ca’Foscari, Venezia), Taroni and Biederman. It is providing a sort of blueprint for using Bayes factors in forensics for both investigative and evaluative purposes. With R code and free access. I am of course unable to judge of the relevance of the approach for forensic science (I was under the impression that Bayesian arguments were usually not well-received in the courtroom) but find that overall the approach is rather one of repositioning the standard Bayesian tools within a forensic framework.

“The [evaluative] purpose is to assign a value to the result of a comparison between an item of unknown source and an item from a known source.”

And thus I found nothing shocking or striking from this standard presentation of Bayes factors, including the call to loss functions, if a bit overly expansive in its exposition. The style is also classical, with a choice of grey background vignettes for R coding parts that we also picked in our R books! If anything, I would have expected more realistic discussions and illustrations of prior specification across the hypotheses (see e.g. page 34), while the authors are mostly centering on conjugate priors and the (de Finetti) trick of the equivalent prior sample size. Bayes factors are mostly assessed using a conservative version of Jeffreys’ “scale of evidence”. The computational section of the book introduces MCMC (briefly) and mentions importance sampling, harmonic mean (with a minimalist warning), and Chib’s formula (with no warning whatsoever).

“The [investigative] purpose is to provide information in investigative proceedings (…) The scientist (…) uses the findings to generate hypotheses and suggestions for explanations of observations, in order to give guidance to investigators or litigants.”

Chapter 2 is about standard models: inferring about a proportion, with some Monte Carlo illustration,  and the complication of background elements, normal mean, with an improper prior making an appearance [on p.69] with no mention being made of the general prohibition of such generalised priors when using Bayes factors or even of the Lindley-Jeffreys paradox. Again, the main difference with Bayesian textbooks stands with the chosen examples.

Chapter 3 focus on evidence evaluation [not in the computational sense] but, again, the coverage is about standard models: processing the Binomial, multinomial, Poisson models, again though conjugates. (With the side remark that Fig 3.2 is rather unhelpful: when moving the prior probability of the null from zero to one, its posterior probability also moves from zero to one!) We are back to the Normal mean case with the model variance being known then unknown. (An unintentionally funny remark (p.96) about the dependence between mean and variance being seen as too restrictive and replaced with… independence!). At last (for me!), the book is pointing [p.99] out that the BF is highly sensitive to the choice of the prior variance (Lindley-Jeffreys, where art thou?!), but with a return of the improper prior (on said variance, p.102) with no debate on the ensuing validity of the BF. Multivariate Normals are also presented, with Wishart priors on the precision matrix, and more details about Chib’s estimate of the evidence. This chapter also contains illustrations of the so-called score-based BF which is simply (?) a Bayes factor using a distribution on a distance summary (between an hypothetical population and the data) and an approximation of the distributions of these summaries, provided enough data is available… I also spotted a potentially interesting foray into BF variability (Section 3.4.2), although not reaching all the way to a notion of BF posterior distributions.

Chapter 4 stands for Bayes factors for investigation, where alternative(s) is(are) less specified, as testing eg Basmati rice vs non-Basmati rice. But there is no non-parametric alternative considered in the book. Otherwise, it looks to me rather similar to Chapter 3, i.e. being back to binomial, multinomial models, with more discussions onm prior specification, more normal, or non-normal model, where the prior distribution is puzzingly estimated by a kernel density estimator, a portmanteau alternative (p.157), more multivariate Normals with Wishart priors and an entry on classification & discrimination.

[Disclaimer about potential self-plagiarism: this post or an edited version will eventually appear in my Books Review section in CHANCE. As appropriate for a book about Chance!]

false confidence, not fake news!

Posted in Books, Statistics with tags , , , , , on May 28, 2021 by xi'an

“…aerospace researchers have recognized a counterintuitive phenomenon in satellite conjunction analysis, known as probability dilution. That is, as uncertainty in the satellite trajectories increases, the epistemic probability of collision eventually decreases. Since trajectory uncertainty is driven by errors in the tracking data, the seemingly absurd implication of probability dilution is that lower quality data reduce the risk of collision.”

In 2019, Balch, Martin, and Ferson published a false confidence theorem [false confidence, not false theorem!] in the Proceedings of the Royal [astatistical] Society, motivated by satellite conjunction (i.e., fatal encounter) analysis. But discussing in fine the very meaning of a confidence statement. And returning to the century old opposition between randomness and epistemic uncertainty, aleatory versus epistemic probabilities.

“…the counterintuitiveness of probability dilution calls this [use of epistemic probability] into question, especially considering [its] unsettled status in the statistics and uncertainty quantification communities.”

The practical aspect of the paper is unclear in that the opposition of aleatory versus epistemic probabilities does not really apply when the model connecting the observables with the position of the satellites is unknown. And replaced with a stylised parametric model. When ignoring this aspect of uncertainty, the debate is mostly moot.

“…the problem with probability dilution is not the mathematics (…) if (…)  inappropriate, that inappropriateness must be rooted in a mismatch between the mathematics of probability theory and the epistemic uncertainty to which they are applied in conjunction analysis.”

The probability dilution phenomenon as described in the paper is that, when (posterior) uncertainty increases, the posterior probability of collision eventually decreases, which makes sense since poor precision implies the observed distance is less trustworthy and the satellite could be anywhere. To conclude that increasing the prior or epistemic uncertainty makes the satellites safer from collision is thus fairly absurd as it only concerns the confidence in the statement that there will be a collision. But I agree with the conclusion that the statement of a low posterior probability is a misleading risk metric because, just like p-values, it is a.s. taken at face value. Bayes factors do relativise this statement [but are not mentioned in the paper]. But with the spectre of Lindley-Jeffreys paradox looming in the background.

The authors’ notion of false confidence is formally a highly probable [in the sample space] report of a high belief in a subset A of the parameter set when the true parameter does not belong to A. Which holds for all epistemic probabilities in the sense that there always exists such a set A. A theorem that I see as related to the fact that integrating an epistemic probability statement [conditional on the data x] wrt the true sampling distribution [itself conditional on the parameter θ] is not coherent from a probabilistic standpoint. The resolution of the paradox follows a principle set by Ryan Martin and Chuanhai Liu, such that “it is almost a tautology that a statistical approach satisfying this criterion will not suffer from the severe false confidence phenomenon”, although it sounds to me that this is a weak patch on a highly perforated tyre, the erroneous interpretation of probabilistic statements as frequentist ones.

demystify Lindley’s paradox [or not]

Posted in Statistics with tags , , , , , on March 18, 2020 by xi'an

Another paper on Lindley’s paradox appeared on arXiv yesterday, by Guosheng Yin and Haolun Shi, interpreting posterior probabilities as p-values. The core of this resolution is to express a two-sided hypothesis as a combination of two one-sided hypotheses along the opposite direction, taking then advantage of the near equivalence of posterior probabilities under some non-informative prior and p-values in the later case. As already noted by George Casella and Roger Berger (1987) and presumably earlier. The point is that one-sided hypotheses are quite friendly to improper priors, since they only require a single prior distribution. Rather than two when point nulls are under consideration. The p-value created by merging both one-sided hypotheses makes little sense to me as it means testing that both θ≥0 and θ≤0, resulting in the proposal of a p-value that is twice the minimum of the one-sided p-values, maybe due to a Bonferroni correction, although the true value should be zero… I thus see little support for this approach to resolving Lindley paradox in that it bypasses the toxic nature of point-null hypotheses that require a change of prior toward a mixture supporting one hypothesis and the other. Here the posterior of the point-null hypothesis is defined in exactly the same way the p-value is defined, hence making the outcome most favourable to the agreement but not truly addressing the issue.

logic (not logistic!) regression

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , on February 12, 2020 by xi'an

A Bayesian Analysis paper by Aliaksandr Hubin, Geir Storvik, and Florian Frommlet on Bayesian logic regression was open for discussion. Here are some hasty notes I made during our group discussion in Paris Dauphine (and later turned into a discussion submitted to Bayesian Analysis):

“Originally logic regression was introduced together with likelihood based model selection, where simulated annealing served as a strategy to obtain one “best” model.”

Indeed, logic regression is not to be confused with logistic regression! Rejection of a true model in Bayesian model choice leads to Bayesian model choice and… apparently to Bayesian logic regression. The central object of interest is a generalised linear model based on a vector of binary covariates and using some if not all possible logical combinations (trees) of said covariates (leaves). The GLM is further using rather standard indicators to signify whether or not some trees are included in the regression (and hence the model). The prior modelling on the model indices sounds rather simple (simplistic?!) in that it is only function of the number of active trees, leading to an automated penalisation of larger trees and not accounting for a possible specificity of some covariates. For instance when dealing with imbalanced covariates (much more 1 than 0, say).

A first question is thus how much of a novel model this is when compared with say an analysis of variance since all covariates are dummy variables. Culling the number of trees away from the exponential of exponential number of possible covariates remains obscure but, without it, the model is nothing but variable selection in GLMs, except for “enjoying” a massive number of variables. Note that there could be a connection with variable length Markov chain models but it is not exploited there.

“…using Jeffrey’s prior for model selection has been widely criticized for not being consistent once the true model coincides with the null model.”

A second point that strongly puzzles me in the paper is its loose handling of improper priors. It is well-known that improper priors are at worst fishy in model choice settings and at best avoided altogether, to wit the Lindley-Jeffreys paradox and friends. Not only does the paper adopts the notion of a same, improper, prior on the GLM scale parameter, which is a position adopted in some of the Bayesian literature, but it also seems to be using an improper prior on each set of parameters (further undifferentiated between models). Because the priors operate on different (sub)sets of parameters, I think this jeopardises the later discourse on the posterior probabilities of the different models since they are not meaningful from a probabilistic viewpoint, with no joint distribution as a reference, neither marginal density. In some cases, p(y|M) may become infinite. Referring to a “simple Jeffrey’s” prior in this setting is therefore anything but simple as Jeffreys (1939) himself shied away from using improper priors on the parameter of interest. I find it surprising that this fundamental and well-known difficulty with improper priors in hypothesis testing is not even alluded to in the paper. Its core setting thus seems to be flawed. Now, the numerical comparison between Jeffrey’s [sic] prior and a regular g-prior exhibits close proximity and I thus wonder at the reason. Could it be that the culling and selection processes end up having the same number of variables and thus eliminate the impact of the prior? Or is it due to the recourse to a Laplace approximation of the marginal likelihood that completely escapes the lack of definition of the said marginal? Computing the normalising constant and repeating this computation while the algorithm is running ignores the central issue.

“…hereby, all states, including all possible models of maximum sized, will eventually be visited.”

Further, I found some confusion between principles and numerics. And as usual bemoan the acronym inflation with the appearance of a GMJMCMC! Where G stands for genetic (algorithm), MJ for mode jumping, and MCMC for…, well no surprise there! I was not aware of the mode jumping algorithm of Hubin and Storvik (2018), so cannot comment on the very starting point of the paper. A fundamental issue with Markov chains on discrete spaces is that the notion of neighbourhood becomes quite fishy and is highly dependent on the nature of the covariates. And the Markovian aspects are unclear because of the self-avoiding aspect of the algorithm. The novel algorithm is intricate and as such seems to require a superlative amount of calibration. Are all modes truly visited, really? (What are memetic algorithms?!)

O’Bayes 19/1 [snapshots]

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , , , , , on June 30, 2019 by xi'an

Although the tutorials of O’Bayes 2019 of yesterday were poorly attended, albeit them being great entries into objective Bayesian model choice, recent advances in MCMC methodology, and the multiple layers of BART, for which I have to blame myself for sticking the beginning of O’Bayes too closely to the end of BNP as only the most dedicated could achieve the commuting from Oxford to Coventry to reach Warwick in time, the first day of talks were well attended, despite weekend commitments, conference fatigue, and perfect summer weather! Here are some snapshots from my bench (and apologies for not covering better the more theoretical talks I had trouble to follow, due to an early and intense morning swimming lesson! Like Steve Walker’s utility based derivation of priors that generalise maximum entropy priors. But being entirely independent from the model does not sound to me like such a desirable feature… And Natalia Bochkina’s Bernstein-von Mises theorem for a location scale semi-parametric model, including a clever construct of a mixture of two Dirichlet priors to achieve proper convergence.)

Jim Berger started the day with a talk on imprecise probabilities, involving the society for imprecise probability, which I discovered while reading Keynes’ book, with a neat resolution of the Jeffreys-Lindley paradox, when re-expressing the null as an imprecise null, with the posterior of the null no longer converging to one, with a limit depending on the prior modelling, if involving a prior on the bias as well, with Chris discussing the talk and mentioning a recent work with Edwin Fong on reinterpreting marginal likelihood as exhaustive X validation, summing over all possible subsets of the data [using log marginal predictive].Håvard Rue did a follow-up talk from his Valencià O’Bayes 2015 talk on PC-priors. With a pretty hilarious introduction on his difficulties with constructing priors and counseling students about their Bayesian modelling. With a list of principles and desiderata to define a reference prior. However, I somewhat disagree with his argument that the Kullback-Leibler distance from the simpler (base) model cannot be scaled, as it is essentially a log-likelihood. And it feels like multivariate parameters need some sort of separability to define distance(s) to the base model since the distance somewhat summarises the whole departure from the simpler model. (Håvard also joined my achievement of putting an ostrich in a slide!) In his discussion, Robin Ryder made a very pragmatic recap on the difficulties with constructing priors. And pointing out a natural link with ABC (which brings us back to Don Rubin’s motivation for introducing the algorithm as a formal thought experiment).

Sara Wade gave the final talk on the day about her work on Bayesian cluster analysis. Which discussion in Bayesian Analysis I alas missed. Cluster estimation, as mentioned frequently on this blog, is a rather frustrating challenge despite the simple formulation of the problem. (And I will not mention Larry’s tequila analogy!) The current approach is based on loss functions directly addressing the clustering aspect, integrating out the parameters. Which produces the interesting notion of neighbourhoods of partitions and hence credible balls in the space of partitions. It still remains unclear to me that cluster estimation is at all achievable, since the partition space explodes with the sample size and hence makes the most probable cluster more and more unlikely in that space. Somewhat paradoxically, the paper concludes that estimating the cluster produces a more reliable estimator on the number of clusters than looking at the marginal distribution on this number. In her discussion, Clara Grazian also pointed the ambivalent use of clustering, where the intended meaning somehow diverges from the meaning induced by the mixture model.

%d bloggers like this: