Frequency vs. probability
“Probabilities obtained by maximum entropy cannot be relevant to physical predictions because they have nothing to do with frequencies.” E.T. Jaynes, PT, p.366
“A frequency is a factual property of the real world that we measure or estimate. The phrase `estimating a probability’ is just as much an incongruity as `assigning a frequency’. The fundamental, inescapable distinction between probability and frequency lies in this relativity principle: probabilities change when we change our state of knowledge, frequencies do not.” E.T. Jaynes, PT, p.292
A few days ago, I got the following email exchange with Jelle Wybe de Jong from The Netherlands:
Q. I have a question regarding your slides of your presentation of Jaynes’ Probability Theory. You used the [above second] quote: Do you agree with this statement? It seems to me that a lot of ‘Bayesians’ still refer to ‘estimating’ probabilities. Does it make sense for example for a bank to estimate a probability of default for their loan portfolio? Or does it only make sense to estimate a default frequency and summarize the uncertainty (state of knowledge) through the posterior?
Indeed, I actually stressed this quote in my slides because the distinction did not make much sense to me. In my opinion, a frequency is a statistic, hence derived from the data. If the data is not yet observed, we (as Bayesians) can construct a predictive about the incoming frequency. If the data is already observed, the frequency is also observed. Jaynes does not necessarily means the same thing. In the above he mentions the ‘real world’, so this would mean, I think, that there exists a real parameter p (called factual frequency) that governs the probability distribution in a Bernoulli experiment. In what may be an over-interpretation of this quote, the notion of probability is restricted (by Jaynes) to the prior distribution on p and there is indeed no way one can “estimate the prior”. If, however, p is also called a probability (driving the distribution behind the data), then I see no problem with estimating p. Things are not completely clear from Jaynes’ writing either because on the very same page he relates “observed frequency” and “Laplace’s succession rule”, which would then mean my earlier interpretation as a statistic. Chapter 10 of Probability Theory (Physics of ‘random experiments’) seems to delve deeper into this distinction, however I have not read this chapter deeply enough to draw a conclusion. Another possible interpretation is the one of a finite static world in which the ultimate frequency would be the frequency of the trait or pattern within the whole (static and immutable) population—as maybe with the set of all potential customers of the portfolio business—, all binary experiments being then of the hypergeometric type… Just to summarise: I see no problem with estimating a parameter driving the probability distribution assumed on the data as long as point estimates are not the final answers. Estimation is one incomplete if useful summary of the posterior distribution.
Re. Just to be clear: I do not have problems with estimating the parameter p, but I would not call it a probability. To me, in following Jaynes’, it is a physical property of the ‘urn’ that contains all clients of the bank (not necessarily a static picture, but could also include a time dimension. Although the question of course arises whether your observation are representative of this underlying population). Given the data you want to infer something about the contents of the urn. And your final knowledge is expressed through the posterior distribution for p (which to me means proportion). Maybe it is just semantics (not unimportant though).
So this is indeed a question of semantics. To me, there is only one kind of probability (theory), the one defined according to Kolmogorov’s and Lebesgue’s principles, whether it applies to past or to future data, or to parameters, it is all the same.
Q. A related question: Suppose that a bank has some knowledge on the future relative default frequency of their loan portfolio through a posterior distribution for this default frequency. What practical decisions could the bank make on the basis of this information as the probabilities of the default frequencies have no frequentist (i.e. observable) interpretation anymore (as in: in the ‘long run’ we expect that the actual default frequencies will equal the probability of default), but only summarize a state of knowledge. In other words how can a connection be made between observable facts and the uncertainty of the probability of default as given by the probability distribution?
Again, there is a similar degree of uncertainty in the use of frequency in this question (or in the way I understand it). If the “future relative default frequency” is a future observation, a Bayesian approach would derive a predictive on this future realisation. Then, handling some side or prior information about the predictive distribution sounds like an inverse problem where the prior distribution on the parameter(s) of the ”future relative default frequency” has to be constrained through the properties of the predictive. This may be a computational challenge (solving an integral functional equation) or even a mathematical impossibility (if the side information clashes with the shape of the predictive), but this is still conceivable. I am not sure, however, that this is the true meaning of the question.
Re. I was essentially asking how to evaluate the future performance of the estimator of p. In a frequentist setting you would have the sampling distribution, which would give you some idea about the error in repetitively applying the estimator. The posterior distribution doesn’t have such an interpretation, I think, and can only serve to compare competing hypothesis? The Bayesian probability assignments are the result of optimally processing the information you have, but to see the impact of using the estimator or back-testing it, you should probably make a connection to a frequency. (You could of course just calculate the sampling distribution for the Bayesian estimator?) Maybe this also connects to my lack of understanding of decision theory in a Bayesian setting: In the frequentist setting you minimize expected loss over all possible datasets: to me a clear, although maybe not that useful concept. In the Bayesian setting you minimize expected loss wrt the posterior: I’m not sure what this means, as I’m not sure how the weighting with the posterior probabilities is to be interpreted. Probably, I have to think about this some more too, maybe I should buy your book, the Bayesian Choice!
This somehow reminds me of Samaniego’s book on the (frequentist?) comparison between Bayesian and frequentist procedures I reviewed for ISR. Using the posterior distribution, you can obviously describe the predictive distribution of future estimators of p, because they are functions of the current data (known and fixed) and of the yet-to-come data (described by the predictive). This is used in Bayesian design, for one thing. You can therefore construct a predictive distribution about the error of the (optimal) Bayesian estimator of p with a further m observations. Again, I am not sure I understand the meaning of “the error in repetitively applying the estimator”: a frequentist analysis would be useful only if you had to repeat (over & over) the estimation experiment with another portfolio corresponding to another p but if you aim at the same portfolio with the same population and an increasing sample size, it is not appropriate because you consider the same p… Maybe indeed you should take a look at the second chapter of The Bayesian Choice to see motivations for running a posterior decision theoretic analysis.