## Bayes Factors for Forensic Decision Analyses with R [book review]

Posted in Books, R, Statistics with tags , , , , , , , , , , , , , on November 28, 2022 by xi'an

My friend EJ Wagenmaker pointed me towards an entire book on the BF by Bozza (from Ca’Foscari, Venezia), Taroni and Biederman. It is providing a sort of blueprint for using Bayes factors in forensics for both investigative and evaluative purposes. With R code and free access. I am of course unable to judge of the relevance of the approach for forensic science (I was under the impression that Bayesian arguments were usually not well-received in the courtroom) but find that overall the approach is rather one of repositioning the standard Bayesian tools within a forensic framework.

“The [evaluative] purpose is to assign a value to the result of a comparison between an item of unknown source and an item from a known source.”

And thus I found nothing shocking or striking from this standard presentation of Bayes factors, including the call to loss functions, if a bit overly expansive in its exposition. The style is also classical, with a choice of grey background vignettes for R coding parts that we also picked in our R books! If anything, I would have expected more realistic discussions and illustrations of prior specification across the hypotheses (see e.g. page 34), while the authors are mostly centering on conjugate priors and the (de Finetti) trick of the equivalent prior sample size. Bayes factors are mostly assessed using a conservative version of Jeffreys’ “scale of evidence”. The computational section of the book introduces MCMC (briefly) and mentions importance sampling, harmonic mean (with a minimalist warning), and Chib’s formula (with no warning whatsoever).

“The [investigative] purpose is to provide information in investigative proceedings (…) The scientist (…) uses the findings to generate hypotheses and suggestions for explanations of observations, in order to give guidance to investigators or litigants.”

Chapter 2 is about standard models: inferring about a proportion, with some Monte Carlo illustration,  and the complication of background elements, normal mean, with an improper prior making an appearance [on p.69] with no mention being made of the general prohibition of such generalised priors when using Bayes factors or even of the Lindley-Jeffreys paradox. Again, the main difference with Bayesian textbooks stands with the chosen examples.

Chapter 3 focus on evidence evaluation [not in the computational sense] but, again, the coverage is about standard models: processing the Binomial, multinomial, Poisson models, again though conjugates. (With the side remark that Fig 3.2 is rather unhelpful: when moving the prior probability of the null from zero to one, its posterior probability also moves from zero to one!) We are back to the Normal mean case with the model variance being known then unknown. (An unintentionally funny remark (p.96) about the dependence between mean and variance being seen as too restrictive and replaced with… independence!). At last (for me!), the book is pointing [p.99] out that the BF is highly sensitive to the choice of the prior variance (Lindley-Jeffreys, where art thou?!), but with a return of the improper prior (on said variance, p.102) with no debate on the ensuing validity of the BF. Multivariate Normals are also presented, with Wishart priors on the precision matrix, and more details about Chib’s estimate of the evidence. This chapter also contains illustrations of the so-called score-based BF which is simply (?) a Bayes factor using a distribution on a distance summary (between an hypothetical population and the data) and an approximation of the distributions of these summaries, provided enough data is available… I also spotted a potentially interesting foray into BF variability (Section 3.4.2), although not reaching all the way to a notion of BF posterior distributions.

Chapter 4 stands for Bayes factors for investigation, where alternative(s) is(are) less specified, as testing eg Basmati rice vs non-Basmati rice. But there is no non-parametric alternative considered in the book. Otherwise, it looks to me rather similar to Chapter 3, i.e. being back to binomial, multinomial models, with more discussions onm prior specification, more normal, or non-normal model, where the prior distribution is puzzingly estimated by a kernel density estimator, a portmanteau alternative (p.157), more multivariate Normals with Wishart priors and an entry on classification & discrimination.

[Disclaimer about potential self-plagiarism: this post or an edited version will eventually appear in my Books Review section in CHANCE. As appropriate for a book about Chance!]

## 21w5107 [½day 3]

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , on December 2, 2021 by xi'an

Day [or half-day] three started without firecrackers and with David Rossell (formerly Warwick) presenting an empirical Bayes approach to generalised linear model choice with a high degree of confounding, using approximate Laplace approximations. With considerable improvements in the experimental RMSE. Making feeling sorry there was no apparent fully (and objective?) Bayesian alternative! (Two more papers on my reading list that I should have read way earlier!) Then Veronika Rockova discussed her work on approximate Metropolis-Hastings by classification. (With only a slight overlap with her One World ABC seminar.) Making me once more think of Geyer’s n⁰564 technical report, namely the estimation of a marginal likelihood by a logistic discrimination representation. Her ABC resolution replaces the tolerance step by an exponential of minus the estimated Kullback-Leibler divergence between the data density and the density associated with the current value of the parameter. (I wonder if there is a residual multiplicative constant there… Presumably not. Great idea!) The classification step need be run at every iteration, which could be sped up by subsampling.

On the always fascinating theme of loss based posteriors, à la Bissiri et al., Jack Jewson (formerly Warwick) exposed his work generalised Bayesian and improper models (from Birmingham!). Using data to decide between model and loss, which sounds highly unorthodox! First difficulty is that losses are unscaled. Or even not integrable after an exponential transform. Hence the notion of improper models. As in the case of robust Tukey’s loss, which is bounded by an arbitrary κ. Immediately I wonder if the fact that the pseudo-likelihood does not integrate is important beyond the (obvious) absence of a normalising constant. And the fact that this is not a generative model. And the answer came a few slides later with the use of the Hyvärinen score. Rather than the likelihood score. Which can itself be turned into a H-posterior, very cool indeed! Although I wonder at the feasibility of finding an [objective] prior on κ.

Rajesh Ranganath completed the morning session with a talk on [the difficulty of] connecting Bayesian models and complex prediction models. Using instead a game theoretic approach with Brier scores under censoring. While there was a connection with Veronika’s use of a discriminator as a likelihood approximation, I had trouble catching the overall message…

## Metropolis-Hastings via Classification [One World ABC seminar]

Posted in Statistics, University life with tags , , , , , , , , , , , , , , , on May 27, 2021 by xi'an

Today, Veronika Rockova is giving a webinar on her paper with Tetsuya Kaji Metropolis-Hastings via classification. at the One World ABC seminar, at 11.30am UK time. (Which was also presented at the Oxford Stats seminar last Feb.) Please register if not already a member of the 1W ABC mailing list.

## frontier of simulation-based inference

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , on June 11, 2020 by xi'an

“This paper results from the Arthur M. Sackler Colloquium of the National Academy of Sciences, `The Science of Deep Learning,’ held March 13–14, 2019, at the National Academy of Sciences in Washington, DC.”

A paper by Kyle Cranmer, Johann Brehmer, and Gilles Louppe just appeared in PNAS on the frontier of simulation-based inference. Sounding more like a tribune than a research paper producing new input. Or at least like a review. Providing a quick introduction to simulators, inference, ABC. Stating the shortcomings of simulation-based inference as three-folded:

1. costly, since required a large number of simulated samples
2. loosing information through the use of insufficient summary statistics or poor non-parametric approximations of the sampling density.
3. wasteful as requiring new computational efforts for new datasets, primarily for ABC as learning the likelihood function (as a function of both the parameter θ and the data x) is only done once.

And the difficulties increase with the dimension of the data. While the points made above are correct, I want to note that ideally ABC (and Bayesian inference as a whole) only depends on a single dimension observation, which is the likelihood value. Or more practically that it only depends on the distance from the observed data to the simulated data. (Possibly the Wasserstein distance between the cdfs.) And that, somewhat unrealistically, that ABC could store the reference table once for all. Point 3 can also be debated in that the effort of learning an approximation can only be amortized when exactly the same model is re-employed with new data, which is likely in industrial applications but less in scientific investigations, I would think. About point 2, the paper misses part of the ABC literature on selecting summary statistics, e.g., the culling afforded by random forests ABC, or the earlier use of the score function in Martin et al. (2019).

The paper then makes a case for using machine-, active-, and deep-learning advances to overcome those blocks. Recouping other recent publications and talks (like Dennis on One World ABC’minar!). Once again presenting machine-learning techniques such as normalizing flows as more efficient than traditional non-parametric estimators. Of which I remain unconvinced without deeper arguments [than the repeated mention of powerful machine-learning techniques] on the convergence rates of these estimators (rather than extolling the super-powers of neural nets).

“A classifier is trained using supervised learning to discriminate two sets of data, although in this case both sets come from the simulator and are generated for different parameter points θ⁰ and θ¹. The classifier output function can be converted into an approximation of the likelihood ratio between θ⁰ and θ¹ (…) learning the likelihood or posterior is an unsupervised learning problem, whereas estimating the likelihood ratio through a classifier is an example of supervised learning and often a simpler task.”

The above comment is highly connected to the approach set by Geyer in 1994 and expanded in Gutmann and Hyvärinen in 2012. Interestingly, at least from my narrow statistician viewpoint!, the discussion about using these different types of approximation to the likelihood and hence to the resulting Bayesian inference never engages into a quantification of the approximation or even broaches upon the potential for inconsistent inference unlocked by using fake likelihoods. While insisting on the information loss brought by using summary statistics.

“Can the outcome be trusted in the presence of imperfections such as limited sample size, insufficient network capacity, or inefficient optimization?”

Interestingly [the more because the paper is classified as statistics] the above shows that the statistical question is set instead in terms of numerical error(s). With proposals to address it ranging from (unrealistic) parametric bootstrap to some forms of GANs.

## from here to infinity

Posted in Books, Statistics, Travel with tags , , , , , , , , , , , , , on September 30, 2019 by xi'an

“Introducing a sparsity prior avoids overfitting the number of clusters not only for finite mixtures, but also (somewhat unexpectedly) for Dirichlet process mixtures which are known to overfit the number of clusters.”

On my way back from Clermont-Ferrand, in an old train that reminded me of my previous ride on that line that took place in… 1975!, I read a fairly interesting paper published in Advances in Data Analysis and Classification by [my Viennese friends] Sylvia Früwirth-Schnatter and Gertrud Malsiner-Walli, where they describe how sparse finite mixtures and Dirichlet process mixtures can achieve similar results when clustering a given dataset. Provided the hyperparameters in both approaches are calibrated accordingly. In both cases these hyperparameters (scale of the Dirichlet process mixture versus scale of the Dirichlet prior on the weights) are endowed with Gamma priors, both depending on the number of components in the finite mixture. Another interesting feature of the paper is to witness how close the related MCMC algorithms are when exploiting the stick-breaking representation of the Dirichlet process mixture. With a resolution of the label switching difficulties via a point process representation and k-mean clustering in the parameter space. [The title of the paper is inspired from Ian Stewart’s book.]