Conditional love [guest post]

[When Dan Simpson told me he was reading Terenin’s and Draper’s latest arXival in a nice Bath pub—and not a nice bath tub!—, I asked him for a blog entry and he agreed. Here is his piece, read at your own risk! If you remember to skip the part about Céline Dion, you should enjoy it very much!!!]

Probability has traditionally been described, as per Kolmogorov and his ardent follower Katy Perry, unconditionally. This is, of course, excellent for those of us who really like measure theory, as the maths is identical. Unfortunately mathematical convenience is not necessarily enough and a large part of the applied statistical community is working with Bayesian methods. These are unavoidably conditional and, as such, it is natural to ask if there is a fundamentally conditional basis for probability.

Bruno de Finetti—and later Richard Cox and Edwin Jaynes—considered conditional bases for Bayesian probability that are, unfortunately, incomplete. The critical problem is that they mainly consider finite state spaces and construct finitely additive systems of conditional probability. For a variety of reasons, neither of these restrictions hold much truck in the modern world of statistics.

In a recently arXiv’d paper, Alexander Terenin and David Draper devise a set of axioms that make the Cox-Jaynes system of conditional probability rigorous. Furthermore, they show that the complete set of Kolmogorov axioms (including countable additivity) can be derived as theorems from their axioms by conditioning on the entire sample space.

This is a deep and fundamental paper, which unfortunately means that I most probably do not grasp it’s complexities (especially as, for some reason, I keep reading it in pubs!). However I’m going to have a shot at having some thoughts on it, because I feel like it’s the sort of paper one should have thoughts on.

The paper begins with a solid introduction to the systems of probability introduced by both Kolmogorov (presented here as inherently frequentist) and de Finetti (later considered in Remark 2.6 to be essentially useless due to its focus on finite additivity on infinite state spaces). The bulk of the paper is devoted to the Cox-Jaynes system of probability, and the remainder of the “review’’ section focusses on this.

Cox-Jaynes consider a two sets of propositions. The first, denoted A, is of unknown truth status, while the second B is of known truth status and they aim to construct “plausibility functions” pl(A|B) that can be used like (? as?) conditional probabilities to describe how the unknown truth status of A is updated by our knowledge of B. This is the point where I need to admit to two egregious sins: I know absolutely nothing about logic and Boolean algebras, and I have never read Jaynes’ book. (At least one of these will disgust Professor Draper) As such, I am massively relieved when it is pointed out that Stone (of Stone-Weierstrass fame!) showed that every Boolean system is isomorphic to a set-theoretic space. So plausibility is replaced by conditional probability and we can all let out that breath we’ve been holding.

With this in mind, Terenin and Draper (henceforth TD) propose a set of set-theoretic axioms based off conditional probability. The first four are uncontroversial, albeit rather abstract. (There is a slightly awkward notational shift here from plausibility to conditional probability)

A1: Probability is a real number. [This is needless generality:

P(A | B) \in [0,1]\,\forall (A,B) \in \mathcal{F} \times (\mathcal{F} \backslash \emptyset)

is enough]

A2: (Continuity) For all non-empty B and sets \{ A_i\} strictly increasing to A, P(A_i |B) \nearrow P(A | B) and similarly for decreasing sets.

A3: (Product rule) P(A \cap B | C) = f(P(U | V), P(X|Y)) for some function f(.,.) and some sets (U,V,X,Y) “involving” (A,B,C)

A4: (Sum rule) There exists a function h such that 1-P(A | B) = h(P(A | B))

The interesting axiom is the fifth one, which is required to obtain countable additivity and avoid paradoxes. A previous “Density” axiom, stated informally as “P(. | B) is dense in [0,1]” also had this function, but as stated required the state space to be at least countably infinite, which is not massively useful.

TD propose an axiom (A5) of “Comparative Extendibility”, which, loosely, says that either the domain is already dense in [0,1] or it’s OK to extend the range to the interval [0,1] and add some independent randomness in a way that preserves the product rule. To my mind, this axiom is not crazy, but there’s an odd statement that it’s as strong as the Density axiom, which is not obvious to me (especially given that this axiom allows for finite state spaces).

This is enough to be able to derive a system of (Kolmogorov) probability using conditional probability as a primitive. In fact, it is shown in Theorem 4.14 that these concepts are isomorphic (i.e. conditional probability in the Kolmogorov sense begets a plausibility function in the “Cox-Jaynes + Extendibility” sense). This actually makes me query TD’s first “future work” question in Remark 5.7. If Comparative Extendability leads to a system isomorphic to the Kolmogorov system, I am at a loss to see what gain there is in relaxing, or weakening, or finding an alternative to this assumption. Surely the isomorphism result pinpoints this Axiom as the one we want!

TD also take an interesting detour towards Bayesian Non-Parametrics. Unfortunately, this is rendered in eye-shattering italics by the unfortunate Series B (at a guess) style file. JRSSS B is an excellent journal, but it is an exceedingly ugly one and I would really really appreciate a change to the house style! (Compare it with Annals of Statistics, or the single column version of Statistical Science, which are both absolutely lovely).

Leaving aside the aesthetics (and I have been reliably informed that no one who listens to as much Céline Dion as I do can style themselves an aesthete. As I write, My Heart Will Go On is blasting. Were this my own blog, I would now expound upon the obvious parallels between Céline and Bayesian statistics, from obscure French concern to world beating megastar, but I suspect I’m wearing out my host’s good will already [yes you are, Dan!]), the BNP example is fascinating. The authors argue that the Cox-Jaynes-Terenin-Draper axioms (A1-A5) + our old friend Exchangeability lead to a complete, optimal inference for a large scale A/B testing scenario. The idea is that, by de Finetti’s representation theorem, Exchangeability implies that

y_i | F \sim F,

where the CDF F is drawn from some set that is (weak¹-) dense in the set of all CDFs. They then argue that a Dirichlet process with vanishing “sample size” parameter α0 is such a dense process on the space of CDFs. The Cox-Jaynes-Terenin-Draper axioms then lead to a unique way of updating the information based on the data (namely through Bayes’ rule). Hence this is a completely justified Bayesian non-parametric analysis of this type of data that does not inject any new information to the problem. Modulo the footnote below, this is an amazing thing! Somewhat tangential to this laudable aim, they also remind us of Draper’s other 2015 paper, in which he shows that this type of Dirichlet process analysis can be computed almost exactly using the standard, embarrassingly parallel, Bootstrap. I think we all understand at this point that, at least for simple Bayesian problems, MCMC is (in a computer science sense) not a great way to solve large-scale problems. So this sort of computational discovery is of utmost importance!

In the end, I love and hate (obviously more on the side of love) this paper. I’m fascinated by the content and I think this is a serious foundational contribution to Bayesian statistics. I just really wish the formatting was less terrible and that some of the (endless) remarks were converted and extended in text. Parts of the paper are extremely difficult to read due to formatting, or the authors’ occasional terse style. I would’ve appreciated a stronger link between plausibility functions and conditional probabilities (this stumped me the first time I read the paper), as well as a better discussion of exactly how “prior probabilities” exist in this Cox-Jaynes world. I assume I would have this by reading Jaynes’ book, but I unfortunately haven’t. I would’ve also liked the fascinating BNP example to have been its own section.

These minor criticisms are all leading towards one big question that I have upon reading this paper. (Not necessarily a question that this paper could answer, but certainly one that I’d love to hear TD’s thoughts on). What does a lack of information mean in the Cox-Jaynes universe? The example showed that one specific form of weak information (namely exchangeability) is incorporated nicely, but what does Cox-Jaynes say we should do when we have no opinion on the truth status of a set of propositions B? To my eye, this violates the spirit of A1, which explicitly forbids the σ-finite measures that would be necessary to answer this sort of question. Hence, I would much rather have the future work focus on relaxing A1 than relaxing the less exciting (and less practically limiting) A5.

In the end, this is an excellent paper on Cox-Jaynes-style probability, but I really feel that the extensions needed to encompass (or justify or negate) the current practice of applied statistics are more than just the three fairly academic Future Work suggestions contained in Remark 5.7. This is a great paper and it deserves the effort it takes to digest it.

1. Arguably, this is the “hand waving” bit of this argument—there is no infinite dimensional equivalent of the Lebesgue measure, so even an extremely small value of $\alpha_0$ gives an “informative” distribution. I assume this is a thing that BNP specialists can get around. Ideally this result would be (at least almost) independent of the choice of any “nice” net of probability measures on the space of CDFs converging to the non-existent vague measure. I’m going to guess that we will need some non-Gaussian version of an abstract Wiener Space [i.e. a nice embedding into a sensible larger space] to get through this measure theoretical nightmare. Maybe this is why people like Boolean systems!

[Warning: due to vacationing activities of a back-country type, mileage on the ‘Og may vary till JSM next week. Leaving readers more leeway to enjoy or tear apart the above…]

3 Responses to “Conditional love [guest post]”

  1. It’s strange to claim Jaynes restricted himself to finite spaces since most of his applications and examples involved infinite sets. What he did do is only consider well specified limits of finite sets, since that’s the only infinities he or seemingly anyone else ever needs for real applications.

    Doing so instantly avoids any problems of the form: (1) assume an infinte limit already accomplished, but don’t specify how the limit was approached, (2) ask a question whose answer depends on the how the limit was approached, (3) proclaim to the whole world you’ve discovered a paradox in statistics and/or proved Bayesian statsitics is nonsense.

    • Let me say at the outset that I’m a tremendous fan of all of Jaynes’s work. However, It’s not obvious to me that his ‘finite sets policy’ induces complete rigor in his probability system in all cases in which he wants to be able to quantify uncertainty about uncountably infinitely many propositions, simultaneously, in a logically-internally-consistent manner. His beautiful book is filled with examples of this type; for instance, in his section 4.5 he builds a continuous CDF G ( . ) on the unit interval and invites us to evaluate G ( f ) for any 0 < f < 1, at which point we are making uncountably infinitely many probability assertions without having 'snuck up on infinity' in the usual jaynesian 'evaluate A_n and then gently let n get big' manner.

      • (continuation) I’m about to see y = ( y_1, \dots, y_n ), each a real number, and (from problem context) my uncertainty about the y_i is exchangeable. de Finetti proved that one logically-internally-consistent way to express my predictive distribution is

        F ~ p ( F )
        ( y_i | F ) ~IID F

        in which p ( F ) is a prior on CDFs on \Re. Can I compute
        p ( F | y ) using only Jaynes’s ‘sneaking up on infinity’ approach, with any contextually-suitably-rich p ( F )? I’d like to know the answer to that.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s