Archive for Laplace succession rule

a response by Ly, Verhagen, and Wagenmakers

Posted in Statistics with tags , , , , , , , , on March 9, 2017 by xi'an

Following my demise [of the Bayes factor], Alexander Ly, Josine Verhagen, and Eric-Jan Wagenmakers wrote a very detailed response. Which I just saw the other day while in Banff. (If not in Schiphol, which would have been more appropriate!)

“In this rejoinder we argue that Robert’s (2016) alternative view on testing has more in common with Jeffreys’s Bayes factor than he suggests, as they share the same ‘‘shortcomings’’.”

Rather unsurprisingly (!), the authors agree with my position on the dangers to ignore decisional aspects when using the Bayes factor. A point of dissension is the resolution of the Jeffreys[-Lindley-Bartlett] paradox. One consequence derived by Alexander and co-authors is that priors should change between testing and estimating. Because the parameters have a different meaning under the null and under the alternative, a point I agree with in that these parameters are indexed by the model [index!]. But with which I disagree when arguing that the same parameter (e.g., a mean under model M¹) should have two priors when moving from testing to estimation. To state that the priors within the marginal likelihoods “are not designed to yield posteriors that are good for estimation” (p.45) amounts to wishful thinking. I also do not find a strong justification within the paper or the response about choosing an improper prior on the nuisance parameter, e.g. σ, with the same constant. Another a posteriori validation in my opinion. However, I agree with the conclusion that the Jeffreys paradox prohibits the use of an improper prior on the parameter being tested (or of the test itself). A second point made by the authors is that Jeffreys’ Bayes factor is information consistent, which is correct but does not solved my quandary with the lack of precise calibration of the object, namely that alternatives abound in a non-informative situation.

“…the work by Kamary et al. (2014) impressively introduces an alternative view on testing, an algorithmic resolution, and a theoretical justification.”

The second part of the comments is highly supportive of our mixture approach and I obviously appreciate very much this support! Especially if we ever manage to turn the paper into a discussion paper! The authors also draw a connection with Harold Jeffreys’ distinction between testing and estimation, based upon Laplace’s succession rule. Unbearably slow succession law. Which is well-taken if somewhat specious since this is a testing framework where a single observation can send the Bayes factor to zero or +∞. (I further enjoyed the connection of the Poisson-versus-Negative Binomial test with Jeffreys’ call for common parameters. And the supportive comments on our recent mixture reparameterisation paper with Kaniav Kamari and Kate Lee.) The other point that the Bayes factor is more sensitive to the choice of the prior (beware the tails!) can be viewed as a plus for mixture estimation, as acknowledged there. (The final paragraph about the faster convergence of the weight α is not strongly

same data – different models – different answers

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , on June 1, 2016 by xi'an

An interesting question from a reader of the Bayesian Choice came out on X validated last week. It was about Laplace’s succession rule, which I found somewhat over-used, but it was nonetheless interesting because the question was about the discrepancy of the “non-informative” answers derived from two models applied to the data: an Hypergeometric distribution in the Bayesian Choice and a Binomial on Wikipedia. The originator of the question had trouble with the difference between those two “non-informative” answers as she or he believed that there was a single non-informative principle that should lead to a unique answer. This does not hold, even when following a reference prior principle like Jeffreys’ invariant rule or Jaynes’ maximum entropy tenets. For instance, the Jeffreys priors associated with a Binomial and a Negative Binomial distributions differ. And even less when considering that  there is no unity in reaching those reference priors. (Not even mentioning the issue of the reference dominating measure for the definition of the entropy.) This led to an informative debate, which is the point of X validated.

On a completely unrelated topic, the survey ship looking for the black boxes of the crashed EgyptAir plane is called the Laplace.

Mathematical underpinnings of Analytics (theory and applications)

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , on September 25, 2015 by xi'an

“Today, a week or two spent reading Jaynes’ book can be a life-changing experience.” (p.8)

I received this book by Peter Grindrod, Mathematical underpinnings of Analytics (theory and applications), from Oxford University Press, quite a while ago. (Not that long ago since the book got published in 2015.) As a book for review for CHANCE. And let it sit on my desk and in my travel bag for the same while as it was unclear to me that it was connected with Statistics and CHANCE. What is [are?!] analytics?! I did not find much of a definition of analytics when I at last opened the book, and even less mentions of statistics or machine-learning, but Wikipedia told me the following:

“Analytics is a multidimensional discipline. There is extensive use of mathematics and statistics, the use of descriptive techniques and predictive models to gain valuable knowledge from data—data analysis. The insights from data are used to recommend action or to guide decision making rooted in business context. Thus, analytics is not so much concerned with individual analyses or analysis steps, but with the entire methodology.”

Barring the absurdity of speaking of a “multidimensional discipline” [and even worse of linking with the mathematical notion of dimension!], this tells me analytics is a mix of data analysis and decision making. Hence relying on (some) statistics. Fine.

“Perhaps in ten years, time, the mathematics of behavioural analytics will be common place: every mathematics department will be doing some of it.”(p.10)

First, and to start with some positive words (!), a book that quotes both Friedrich Nietzsche and Patti Smith cannot get everything wrong! (Of course, including a most likely apocryphal quote from the now late Yogi Berra does not partake from this category!) Second, from a general perspective, I feel the book meanders its way through chapters towards a higher level of statistical consciousness, from graphs to clustering, to hidden Markov models, without precisely mentioning statistics or statistical model, while insisting very much upon Bayesian procedures and Bayesian thinking. Overall, I can relate to most items mentioned in Peter Grindrod’s book, but mostly by first reconstructing the notions behind. While I personally appreciate the distanced and often ironic tone of the book, reflecting upon the author’s experience in retail modelling, I am thus wondering at which audience Mathematical underpinnings of Analytics aims, for a practitioner would have a hard time jumping the gap between the concepts exposed therein and one’s practice, while a theoretician would require more formal and deeper entries on the topics broached by the book. I just doubt this entry will be enough to lead maths departments to adopt behavioural analytics as part of their curriculum… Continue reading

an easy pun on conditional risk[cd]

Posted in Books, Kids, Statistics with tags , , , , on September 22, 2013 by xi'an

Who’s #1?

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , , , , , , on May 2, 2012 by xi'an

First, apologies for this teaser of a title! This post is not about who is #1 in whatever category you can think of, from statisticians to climbs [the Eiger Nordwand, to be sure!], to runners (Gebrselassie?), to books… (My daughter simply said “c’est moi!” when she saw the cover of this book on my desk.) So this is in fact a book review of…a book with this catching title I received a month or so ago!

We decided to forgo purely statistical methodology, which is probably a disappointment to the hardcore statisticians.” A.N. Langville & C.D. Meyer, Who’s #1? The Science of Rating and Ranking (page 225)

This book may be one of the most boring ones I have had to review so far! The reason for this disgruntled introduction to “Who’s #1? The Science of Rating and Ranking” by Langville and Meyer is that it has very little if any to do with statistics and modelling. (And also that it is mostly about American football, a sport I am not even remotely interested in.) The purpose of the book is to present ways of building rating and ranking within a population, based on pairwise numerical connections between some members of this population. The methods abound, at least eight are covered by the book, but they all suffer from the same drawback that they are connected to no grand truth, to no parameter from an underlying probabilistic model, to no loss function that would measure the impact of a “wrong” rating. (The closer it comes to this is when discussing spread betting in Chapter 9.) It is thus a collection of transformation rules, from matrices to ratings. I find this the more disappointing in that there exists a branch of statistics called ranking and selection that specializes in this kind of problems and that statistics in sports is a quite active branch of our profession, witness the numerous books by Jim Albert. (Not to mention Efron’s analysis of baseball data in the 70’s.)

First suppose that in some absolutely perfect universe there is a perfect rating vector.” A.N. Langville & C.D. Meyer, Who’s #1? The Science of Rating and Ranking (page 117)

The style of the book is disconcerting at first, and then some, as it sounds written partly from Internet excerpts (at least for most of the pictures) and partly from local student dissertations… The mathematical level is highly varying, in that the authors take the pain to define what a matrix is (page 33), only to jump to Perron-Frobenius theorem a few pages later (page 36). It also mentions Laplace’s succession rule (only justified as a shrinkage towards the center, i.e. away from 0 and 1), the Sinkhorn-Knopp theorem, the traveling salesman problem, Arrow and Condorcet, relaxation and evolutionary optimization, and even Kendall’s and Spearman’s rank tests (Chapter 16), even though no statistical model is involved. (Nothing as terrible as the completely inappropriate use of Spearman’s rho coefficient in one of Belfiglio’s studies…)

Since it is hard to say which ranking is better, our point here is simply that different methods can produce vastly different rankings.” A.N. Langville & C.D. Meyer, Who’s #1? The Science of Rating and Ranking (page 78)

I also find irritating the association of “science” with “rating”, because the techniques presented in this book are simply tricks to turn pairwise comparison into a general ordering of a population, nothing to do with uncovering ruling principles explaining the difference between the individuals. Since there is no validation for one ordering against another, we can see no rationality in proposing any of those, except to set a convention. The fascination of the authors for the Markov chain approach to the ranking problem is difficult to fathom as the underlying structure is not dynamical (there is not evolving ranking along games in this book) and the Markov transition matrix is just constructed to derive a stationary distribution, inducing a particular “Markov” ranking.

The Elo rating system is the epitome of simple elegance.” A.N. Langville & C.D. Meyer, Who’s #1? The Science of Rating and Ranking (page 64)

An interesting input of the book is its description of the Elo ranking system used in chess, of which I did not know anything apart from its existence. Once again, there is a high degree of arbitrariness in the construction of the ranking, whose sole goal is to provide a convention upon which most people agree. A convention, mind, not a representation of truth! (This chapter contains a section on the Social Network movie, where a character writes a logistic transform on a window, missing the exponent. This should remind Andrew of someone he often refer to in his blog!)

Perhaps the largest lesson is not to put an undue amount of faith in anyone’s rating.” A.N. Langville & C.D. Meyer, Who’s #1? The Science of Rating and Ranking (page 125)

In conclusion, I see little point in suggesting reading this book, unless one is interested in matrix optimization problems and/or illustrations in American football… Or unless one wishes to write a statistics book on the topic!

Frequency vs. probability

Posted in Statistics with tags , , , , , , , on May 6, 2011 by xi'an

Probabilities obtained by maximum entropy cannot be relevant to physical predictions because they have nothing to do with frequencies.” E.T. Jaynes, PT, p.366

A frequency is a factual property of the real world that we measure or estimate. The phrase `estimating a probability’ is just as much an incongruity as `assigning a frequency’. The fundamental, inescapable distinction between probability and frequency lies in this relativity principle: probabilities change when we change our state of knowledge, frequencies do not.” E.T. Jaynes, PT, p.292

A few days ago, I got the following email exchange with Jelle Wybe de Jong from The Netherlands:

Q. I have a question regarding your slides of your presentation of Jaynes’ Probability Theory. You used the [above second] quote: Do you agree with this statement? It seems to me that a lot of  ‘Bayesians’ still refer to ‘estimating’ probabilities. Does it make sense for example for a bank to estimate a probability of default for their loan portfolio? Or does it only make sense to estimate a default frequency and summarize the uncertainty (state of knowledge) through the posterior? Continue reading

Re-reading An Essay towards solving a Problem in the Doctrine of Chances

Posted in Books, Statistics with tags , , , on May 2, 2011 by xi'an

“We ought to estimate the chance that the probability for the happening of an event perfectly unknown, should lie between any two named degrees of probability, antecedently to any experiment made about it.” Letter of R. Price to J. Canton, Nov. 10, 1763

On a lazy and sunny Sunday afternoon, I re-read Thomas Bayes’ 1763 Essay. (It is available in LaTex, courtesy of Peter Lee.) The major part of the Essay is actually written by Richard Price, Bayes’ contribution being from page 376 to page 399. Most of the introduction by Price (in the form of a letter to John Canton) rephrases Bayes’ findings, but he stresses that Bayes set a “sure foundation for all our reasonings concerning past facts”. In the spirit of the time, he cannot prevent from relating the uncovering of “fixt laws according to which events happened” to the “existence of the Deity”. He also perceives Bayes’ rule as “solving the converse problem” from De Moivre’s Laws of Chances. At last, he stresses that, although chance should relate to past events, while probability relates to future events, the distinction should not impact conditional probability.

“Given the number of times in which an unknown event has happened and failed; Required the chance that the probability of its happening in a single trial lies somewhere between any two degrees of probability that can be named.” Th. Bayes

The Essay itself consists in (a) a “brief demonstration of the general laws of chance”, (b) the derivation of Bayes’ posterior distribution for the uniform-binomial problem, (c) the computation of the posterior probability of an arbitrary interval. The first part is a rewording of De Moivre’s Laws of Chance, in particular recalling the definition of a conditional probability. Maybe the definition of the probability is worth quoting

5. The probability of any event is the ratio between the value at which an expectation depending on the happening of the event ought to be computed and the value of the thing expected upon it’s happening.

because it actually defines a probability as a by-product of the expected number of occurrences within a binomial experiment. (There is therefore nothing frequentist in this definition!) The main part (and the huge novelty) in the Essay is the derivation of the Beta posterior. Surprisingly, the setup is introduced very abruptly (in that nowhere before were those balls mentioned!):

Postulate. 1. Suppose the square table or plane ABCD to be so made and levelled, that if either of the balls o or W be thrown upon it, there shall be the same probability that it re{\st}s upon any one equal part of the plane as another, and that it must necessarily rest somewhere upon it.

and then the derivation starts with a two-page derivation that the prior (uniform) cdf is the uniform cdf.  The next result is Prop. 8 [388] that gives the joint probability that the binomial probability is between f and b and that the binomial experiment gives x=p:

the probability the point o should fall between f and b, any two points named in the line AB, and withall that the event M should happen p times and fail q in p+q trials, is the ratio of fghikmb, the part of the figure BghikmA intercepted between the perpendiculars fg, bm raised upon the line AB, to CA the square upon AB.

where the curve is y=xp(1-x)q.

The next proposition is then Bayes’ rule, still expressed in terms of surface ratio as above,

The same things supposed, I guess that the probability of the event M lies somewhere between 0 and the ratio of Ab to AB, my chance to be in the right is the ratio of Abm to AiB.

but clearly set within the Beta(p+1,q+1) distribution [in modern terms]. Bayes then inserts a scholium where he tries to justify the use of the uniform prior, however I do not see the validity of the reasoning since he seems to argue in favour of a uniform distribution on the marginal distribution of the binomial experiment:

I have no reason to think that, in a certain number of trials, it should rather happen any one possible number of times than another.

The last part of the Essay per se is about deriving a closed form formula for the Beta integral, a feat achieved in Rule I. [399]

P(X\le \theta\le x|p,q) = (p+q){p+q\choose p}\left\{\frac{X^{p+1}}{p+1}-q\frac{X^{p+2}}{p+2}+\cdots\right.

\left.\cdots-\frac{x^{p+1}}{p+1}+q\frac{x^{p+2}}{p+2}-\cdots\right\}

in slightly more modern notations. The 18 remaining pages are written by Richard Price, who first reproduces Bayes’ approximations to the above integral with improvements of his own, then illustrates the performances of such approximations in specific cases, with the astounding fact that the probability covered by the approximation is centred at the MLE:

P(|\theta-p/(p+q)|\le z)

and not at the Bayes posterior mean. This could be extrapolated as one of the earliest confidence sets, except of course that the probability is over the parameter space. I note that Price also derives [409-410] as a consequence of Bayes’ calculations what is now know as Laplace’s succession rule…! Besides the derivation of the posterior distribution itself, which must be a considerable feat for the time, the attention to computational issues is highly commendable, as it would become a constant theme of Bayesian studies for centuries!!!