## a generalized representation of Bayesian inference

Posted in Books with tags , , , , , , on July 5, 2019 by xi'an

Jeremias Knoblauch, Jack Jewson and Theodoros Damoulas, all affiliated with Warwick (hence a potentially biased reading!), arXived a paper on loss-based Bayesian inference that Jack discussed with me on my last visit to Warwick. As I was somewhat scared by the 61 pages, of which the 8 first pages are in NeurIPS style. The authors argue for a decision-theoretic approach to Bayesian inference that involves a loss over distributions and a divergence from the prior. For instance, when using the log-score as the loss and the Kullback-Leibler divergence, the regular posterior emerges, as shown by Arnold Zellner. Variational inference also falls under this hat. The argument for this generalization is that any form of loss can be used and still returns a distribution that is used to assess uncertainty about the parameter (of interest). In the axioms they produce for justifying the derivation of the optimal procedure, including cases where the posterior is restricted to a certain class, one [Axiom 4] generalizes the likelihood principle. Given the freedom brought by this general framework, plenty of fringe Bayes methods like standard variational Bayes can be seen as solutions to such a decision problem. Others like EP do not. Of interest to me are the potentials for this formal framework to encompass misspecification and likelihood-free settings, as well as for assessing priors, which is always a fishy issue. (The authors mention in addition the capacity to build related specific design Bayesian deep networks, of which I know nothing.) The obvious reaction of mine is one of facing an abundance of wealth (!) but encompassing approximate Bayesian solutions within a Bayesian framework remains an exciting prospect.

## O’Bayes 19/1 [snapshots]

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , , , , , on June 30, 2019 by xi'an

Although the tutorials of O’Bayes 2019 of yesterday were poorly attended, albeit them being great entries into objective Bayesian model choice, recent advances in MCMC methodology, and the multiple layers of BART, for which I have to blame myself for sticking the beginning of O’Bayes too closely to the end of BNP as only the most dedicated could achieve the commuting from Oxford to Coventry to reach Warwick in time, the first day of talks were well attended, despite weekend commitments, conference fatigue, and perfect summer weather! Here are some snapshots from my bench (and apologies for not covering better the more theoretical talks I had trouble to follow, due to an early and intense morning swimming lesson! Like Steve Walker’s utility based derivation of priors that generalise maximum entropy priors. But being entirely independent from the model does not sound to me like such a desirable feature… And Natalia Bochkina’s Bernstein-von Mises theorem for a location scale semi-parametric model, including a clever construct of a mixture of two Dirichlet priors to achieve proper convergence.)

Jim Berger started the day with a talk on imprecise probabilities, involving the society for imprecise probability, which I discovered while reading Keynes’ book, with a neat resolution of the Jeffreys-Lindley paradox, when re-expressing the null as an imprecise null, with the posterior of the null no longer converging to one, with a limit depending on the prior modelling, if involving a prior on the bias as well, with Chris discussing the talk and mentioning a recent work with Edwin Fong on reinterpreting marginal likelihood as exhaustive X validation, summing over all possible subsets of the data [using log marginal predictive].Håvard Rue did a follow-up talk from his Valencià O’Bayes 2015 talk on PC-priors. With a pretty hilarious introduction on his difficulties with constructing priors and counseling students about their Bayesian modelling. With a list of principles and desiderata to define a reference prior. However, I somewhat disagree with his argument that the Kullback-Leibler distance from the simpler (base) model cannot be scaled, as it is essentially a log-likelihood. And it feels like multivariate parameters need some sort of separability to define distance(s) to the base model since the distance somewhat summarises the whole departure from the simpler model. (Håvard also joined my achievement of putting an ostrich in a slide!) In his discussion, Robin Ryder made a very pragmatic recap on the difficulties with constructing priors. And pointing out a natural link with ABC (which brings us back to Don Rubin’s motivation for introducing the algorithm as a formal thought experiment).

Sara Wade gave the final talk on the day about her work on Bayesian cluster analysis. Which discussion in Bayesian Analysis I alas missed. Cluster estimation, as mentioned frequently on this blog, is a rather frustrating challenge despite the simple formulation of the problem. (And I will not mention Larry’s tequila analogy!) The current approach is based on loss functions directly addressing the clustering aspect, integrating out the parameters. Which produces the interesting notion of neighbourhoods of partitions and hence credible balls in the space of partitions. It still remains unclear to me that cluster estimation is at all achievable, since the partition space explodes with the sample size and hence makes the most probable cluster more and more unlikely in that space. Somewhat paradoxically, the paper concludes that estimating the cluster produces a more reliable estimator on the number of clusters than looking at the marginal distribution on this number. In her discussion, Clara Grazian also pointed the ambivalent use of clustering, where the intended meaning somehow diverges from the meaning induced by the mixture model.

## leave Bayes factors where they once belonged

Posted in Statistics with tags , , , , , , , , , , on February 19, 2019 by xi'an

In the past weeks I have received and read several papers (and X validated entries)where the Bayes factor is used to compare priors. Which does not look right to me, not on the basis of my general dislike of Bayes factors!, but simply because this seems to clash with the (my?) concept of Bayesian model choice and also because data should not play a role in that situation, from being used to select a prior, hence at least twice to run the inference, to resort to a single parameter value (namely the one behind the data) to decide between two distributions, to having no asymptotic justification, to eventually favouring the prior concentrated on the maximum likelihood estimator. And more. But I fear that this reticence to test for prior adequacy also extends to the prior predictive, or Box’s p-value, namely the probability under this prior predictive to observe something “more extreme” than the current observation, to quote from David Spiegelhalter.

## let the evidence speak [book review]

Posted in Books, Kids, Statistics with tags , , , , , , , , , , on December 17, 2018 by xi'an

This book by Alan Jessop, professor at the Durham University Business School,  aims at presenting Bayesian ideas and methods towards decision making “without formula because they are not necessary; the ability to add and multiply is all that is needed.” The trick is in using a Bayes grid, in other words a two by two table. (There are a few formulas that survived the slaughter, see e.g. on p. 91 the formula for the entropy. Contained in the chapter on information that I find definitely unclear.) When leaving the 2×2 world, things become more complicated and the construction of a prior belief as a probability density gets heroic without the availability of maths formulas. The first part of the paper is about Likelihood, albeit not the likelihood function, despite having the general rule that (p.73)

belief is proportional to base rate x likelihood

which is the book‘s version of Bayes’ (base?!) theorem. It then goes on to discuss the less structure nature of prior (or prior beliefs) against likelihood by describing Tony O’Hagan’s way of scaling experts’ beliefs in terms of a Beta distribution. And mentioning Jaynes’ maximum entropy prior without a single formula. What is hard to fathom from the text is how can one derive the likelihood outside surveys. (Using the illustration of 1963 Oswald’s murder by Ruby in the likelihood chapter does not particularly help!) A bit of nitpicking at this stage: the sentence

“The ancient Greeks, and before them the Chinese and the Aztecs…”

is historically incorrect since, while the Chinese empire dates back before the Greek dark ages, the Aztecs only rule Mexico from the 14th century (AD) until the Spaniard invasion. While most of the book sticks with unidimensional parameters, it also discusses more complex structures, for which it relies on Monte Carlo, although the description is rather cryptic (use your spreadsheet!, p.133). The book at this stage turns into a more story-telling mode, by considering for instance the Federalist papers analysis by Mosteller and Wallace. The reader can only follow the process of assessing a document authorship for a single word, as multidimensional cases (for either data or parameters) are out of reach. The same comment applies to the ecology, archeology, and psychology chapters that follow. The intermediary chapter on the “grossly misleading” [Court wording] of the statistical evidence in the Sally Clark prosecution is more accessible in that (again) it relies on a single number. Returning to the ban of Bayes rule in British courts:

In the light of the strong criticism by this court in the 1990s of using Bayes theorem before the jury in cases where there was no reliable statistical evidence, the practice of using a Bayesian approach and likelihood ratios to formulate opinions placed before a jury without that process being disclosed and debated in court is contrary to principles of open justice.

the discussion found in the book is quite moderate and inclusive, in that a Bayesian analysis helps in gathering evidence about a case, but may be misunderstood or misused at the [non-Bayesian] decision level.

In conclusion, Let the Evidence Speak is an interesting introduction to Bayesian thinking, through a simplifying device, the Bayes grid, which seems to come from management, with a large number of examples, if not necessarily all realistic and some side-stories. I doubt this exposure can produce expert practitioners, but it makes for an worthwhile awakening for someone “likely to have read this book because [one] had heard of Bayes but were uncertain what is was” (p.222). With commendable caution and warnings along the way.

## back to the Bayesian Choice

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , on October 17, 2018 by xi'an

Surprisingly (or not?!), I received two requests about some exercises from The Bayesian Choice, one from a group of students from McGill having difficulties solving the above, wondering about the properness of the posterior (but missing the integration of x), to whom I sent back this correction. And another one from the Czech Republic about a difficulty with the term “evaluation” by which I meant (pardon my French!) estimation.

## Bayesian decision riddle

Posted in Books, Kids, Statistics with tags , , , , on June 15, 2017 by xi'an

The current puzzle on The Riddler is a version of the secretary problem with an interesting (?) Bayesian solution.

Given four positive numbers x¹, x², x³, x⁴, observed sequentially, the associated utility is the value of x at the stopping time. What is the optimal stopping rule?

While nothing is mentioned about the distribution of the x’s, I made the assumption that they were iid and uniformly distributed over (0,M), with M unknown and tried a Bayesian resolution with the non-informative prior π(M)=1/M. And failed. The reason for this failure is that the expected utility is infinite at the first step: while the posterior expected utility is finite with three and two observations, meaning I can compare stopping and continuing at the second and third steps, the predicted expected reward for continuing after observing x¹ does not exist because the expected value of max(x¹,x²) given x¹ does not exist. As the predictive density of x² is max(x¹,x²)⁻²…  Several alternatives are possible to bypass this impossible resolution, from changing the utility function to picking another reference prior.

For instance, using a prior like π(M)=1/M² l(and the same monetary return utility) leads to a proper optimal solution, namely

1. always wait for the second observation x²
2. stop at x² if x²>11x¹/12, else wait for x³
3. stop at x³ if x³>23 max(x¹,x²)/24, else observe x⁴

obtained analytically on a bar table in Rouen (and checked numerically later).

Another approach is to try to optimise the probability to pick the largest amount of the four x’s, but this is not leading to an interesting solution, since it corresponds to picking the first maximum after x¹, while picking the largest among remaining ones leads to a somewhat convoluted solution I have no patience to produce here! Plus this is not a really pertinent loss function as it does not discriminate enough against waiting…

## going to war [a riddle]

Posted in Books, Kids, Statistics with tags , , , , , on December 16, 2016 by xi'an

On the Riddler this week, a seemingly obvious riddle:

A game consists of Alice and Bob, each with a $1 bill, receiving a U(0,1) strength each, unknown to the other, and deciding or not to bet on this strength being larger than the opponent’s. If no player bets, they both keep their$1 bill. Else, the winner leaves with both bills. Find the optimal strategy.

As often when “optimality” is mentioned, the riddle is unclear because, when looking at the problem from a decision-theoretic perspective, the loss function of each player is not defined in the question. But the St. Petersburg paradox shows the type of loss clearly matters and the utility of money is anything but linear for large values, as explained by Daniel Bernoulli in 1738 (and later analysed by Laplace in his Essai Philosophique).  Let us assume therefore that both players live in circumstances when losing or winning \$1 makes little difference, hence when the utility is linear. A loss function attached to the experiment for Alice [and a corresponding utility function for Bob] could then be a function of (a,b), the result of both Uniform draws, and of the decisions δ¹ and δ² of both players as being zero if δ¹=δ²=0 and

$L(a,b,\delta^1,\delta^2)=\begin{cases}0&\text{if }\delta^1=\delta^2=0\\\mathbb{I}(ab)&\text{else}\\\end{cases}$

Considering this loss function, Alice aims at minimising the expected loss by her choice of δ¹, equal to zero or one, expected loss that hence depends  on the unknown and simultaneous decision of Bob. If for instance Alice assumes Bob takes the decision to compete when observing an outcome b larger than a certain bound α, her decision is based on the comparison of (when B is Uniform (0,1))

$\mathbb{P}(a\alpha)-\mathbb{P}(a>B,B>\alpha)=2(1-a\vee\alpha)-(1-\alpha)$

(if δ¹=0) and of 1-2a (if δ¹=1). Comparing both expected losses leads to Alice competing (δ¹=1) when a>α/2.

However, there is no reason Alice should know the value of α when playing the (single) game and so she may think that Bob will follow the same reasoning, leading him to choosing a new bound of α/4, and, by iterating the thought process, down all the way to α=0!  So this modelling leads to always play the game, with each player having a ½ probability to win… Alternatively, Alice may set a prior on α, which leads to another bound on a for playing or not the game. Which in itself is not satisfactory either. (The published solution is following the above argument. Except for posting the maths expressions.)