Conditional love [guest post]
[When Dan Simpson told me he was reading Terenin’s and Draper’s latest arXival in a nice Bath pub—and not a nice bath tub!—, I asked him for a blog entry and he agreed. Here is his piece, read at your own risk! If you remember to skip the part about Céline Dion, you should enjoy it very much!!!]
Probability has traditionally been described, as per Kolmogorov and his ardent follower Katy Perry, unconditionally. This is, of course, excellent for those of us who really like measure theory, as the maths is identical. Unfortunately mathematical convenience is not necessarily enough and a large part of the applied statistical community is working with Bayesian methods. These are unavoidably conditional and, as such, it is natural to ask if there is a fundamentally conditional basis for probability.
Bruno de Finetti—and later Richard Cox and Edwin Jaynes—considered conditional bases for Bayesian probability that are, unfortunately, incomplete. The critical problem is that they mainly consider finite state spaces and construct finitely additive systems of conditional probability. For a variety of reasons, neither of these restrictions hold much truck in the modern world of statistics.
In a recently arXiv’d paper, Alexander Terenin and David Draper devise a set of axioms that make the Cox-Jaynes system of conditional probability rigorous. Furthermore, they show that the complete set of Kolmogorov axioms (including countable additivity) can be derived as theorems from their axioms by conditioning on the entire sample space.
This is a deep and fundamental paper, which unfortunately means that I most probably do not grasp it’s complexities (especially as, for some reason, I keep reading it in pubs!). However I’m going to have a shot at having some thoughts on it, because I feel like it’s the sort of paper one should have thoughts on.
The paper begins with a solid introduction to the systems of probability introduced by both Kolmogorov (presented here as inherently frequentist) and de Finetti (later considered in Remark 2.6 to be essentially useless due to its focus on finite additivity on infinite state spaces). The bulk of the paper is devoted to the Cox-Jaynes system of probability, and the remainder of the “review’’ section focusses on this.
Cox-Jaynes consider a two sets of propositions. The first, denoted A, is of unknown truth status, while the second B is of known truth status and they aim to construct “plausibility functions” pl(A|B) that can be used like (? as?) conditional probabilities to describe how the unknown truth status of A is updated by our knowledge of B. This is the point where I need to admit to two egregious sins: I know absolutely nothing about logic and Boolean algebras, and I have never read Jaynes’ book. (At least one of these will disgust Professor Draper) As such, I am massively relieved when it is pointed out that Stone (of Stone-Weierstrass fame!) showed that every Boolean system is isomorphic to a set-theoretic space. So plausibility is replaced by conditional probability and we can all let out that breath we’ve been holding.
With this in mind, Terenin and Draper (henceforth TD) propose a set of set-theoretic axioms based off conditional probability. The first four are uncontroversial, albeit rather abstract. (There is a slightly awkward notational shift here from plausibility to conditional probability)
A1: Probability is a real number. [This is needless generality:
A2: (Continuity) For all non-empty B and sets strictly increasing to A, and similarly for decreasing sets.
A3: (Product rule) for some function f(.,.) and some sets (U,V,X,Y) “involving” (A,B,C)
A4: (Sum rule) There exists a function h such that 1-P(A | B) = h(P(A | B))
The interesting axiom is the fifth one, which is required to obtain countable additivity and avoid paradoxes. A previous “Density” axiom, stated informally as “P(. | B) is dense in [0,1]” also had this function, but as stated required the state space to be at least countably infinite, which is not massively useful.
TD propose an axiom (A5) of “Comparative Extendibility”, which, loosely, says that either the domain is already dense in [0,1] or it’s OK to extend the range to the interval [0,1] and add some independent randomness in a way that preserves the product rule. To my mind, this axiom is not crazy, but there’s an odd statement that it’s as strong as the Density axiom, which is not obvious to me (especially given that this axiom allows for finite state spaces).
This is enough to be able to derive a system of (Kolmogorov) probability using conditional probability as a primitive. In fact, it is shown in Theorem 4.14 that these concepts are isomorphic (i.e. conditional probability in the Kolmogorov sense begets a plausibility function in the “Cox-Jaynes + Extendibility” sense). This actually makes me query TD’s first “future work” question in Remark 5.7. If Comparative Extendability leads to a system isomorphic to the Kolmogorov system, I am at a loss to see what gain there is in relaxing, or weakening, or finding an alternative to this assumption. Surely the isomorphism result pinpoints this Axiom as the one we want!
TD also take an interesting detour towards Bayesian Non-Parametrics. Unfortunately, this is rendered in eye-shattering italics by the unfortunate Series B (at a guess) style file. JRSSS B is an excellent journal, but it is an exceedingly ugly one and I would really really appreciate a change to the house style! (Compare it with Annals of Statistics, or the single column version of Statistical Science, which are both absolutely lovely).
Leaving aside the aesthetics (and I have been reliably informed that no one who listens to as much Céline Dion as I do can style themselves an aesthete. As I write, My Heart Will Go On is blasting. Were this my own blog, I would now expound upon the obvious parallels between Céline and Bayesian statistics, from obscure French concern to world beating megastar, but I suspect I’m wearing out my host’s good will already [yes you are, Dan!]), the BNP example is fascinating. The authors argue that the Cox-Jaynes-Terenin-Draper axioms (A1-A5) + our old friend Exchangeability lead to a complete, optimal inference for a large scale A/B testing scenario. The idea is that, by de Finetti’s representation theorem, Exchangeability implies that
where the CDF F is drawn from some set that is (weak¹-) dense in the set of all CDFs. They then argue that a Dirichlet process with vanishing “sample size” parameter α0 is such a dense process on the space of CDFs. The Cox-Jaynes-Terenin-Draper axioms then lead to a unique way of updating the information based on the data (namely through Bayes’ rule). Hence this is a completely justified Bayesian non-parametric analysis of this type of data that does not inject any new information to the problem. Modulo the footnote below, this is an amazing thing! Somewhat tangential to this laudable aim, they also remind us of Draper’s other 2015 paper, in which he shows that this type of Dirichlet process analysis can be computed almost exactly using the standard, embarrassingly parallel, Bootstrap. I think we all understand at this point that, at least for simple Bayesian problems, MCMC is (in a computer science sense) not a great way to solve large-scale problems. So this sort of computational discovery is of utmost importance!
In the end, I love and hate (obviously more on the side of love) this paper. I’m fascinated by the content and I think this is a serious foundational contribution to Bayesian statistics. I just really wish the formatting was less terrible and that some of the (endless) remarks were converted and extended in text. Parts of the paper are extremely difficult to read due to formatting, or the authors’ occasional terse style. I would’ve appreciated a stronger link between plausibility functions and conditional probabilities (this stumped me the first time I read the paper), as well as a better discussion of exactly how “prior probabilities” exist in this Cox-Jaynes world. I assume I would have this by reading Jaynes’ book, but I unfortunately haven’t. I would’ve also liked the fascinating BNP example to have been its own section.
These minor criticisms are all leading towards one big question that I have upon reading this paper. (Not necessarily a question that this paper could answer, but certainly one that I’d love to hear TD’s thoughts on). What does a lack of information mean in the Cox-Jaynes universe? The example showed that one specific form of weak information (namely exchangeability) is incorporated nicely, but what does Cox-Jaynes say we should do when we have no opinion on the truth status of a set of propositions B? To my eye, this violates the spirit of A1, which explicitly forbids the σ-finite measures that would be necessary to answer this sort of question. Hence, I would much rather have the future work focus on relaxing A1 than relaxing the less exciting (and less practically limiting) A5.
In the end, this is an excellent paper on Cox-Jaynes-style probability, but I really feel that the extensions needed to encompass (or justify or negate) the current practice of applied statistics are more than just the three fairly academic Future Work suggestions contained in Remark 5.7. This is a great paper and it deserves the effort it takes to digest it.
1. Arguably, this is the “hand waving” bit of this argument—there is no infinite dimensional equivalent of the Lebesgue measure, so even an extremely small value of $\alpha_0$ gives an “informative” distribution. I assume this is a thing that BNP specialists can get around. Ideally this result would be (at least almost) independent of the choice of any “nice” net of probability measures on the space of CDFs converging to the non-existent vague measure. I’m going to guess that we will need some non-Gaussian version of an abstract Wiener Space [i.e. a nice embedding into a sensible larger space] to get through this measure theoretical nightmare. Maybe this is why people like Boolean systems!
[Warning: due to vacationing activities of a back-country type, mileage on the ‘Og may vary till JSM next week. Leaving readers more leeway to enjoy or tear apart the above…]