## down with Galton (and Pearson and Fisher…)

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , on July 22, 2019 by xi'an

In the last issue of Significance, which I read in Warwick prior to the conference, there is a most interesting article on Galton’s eugenics, his heritage at University College London (UCL), and the overall trouble with honouring prominent figures of the past with memorials like named building or lectures… The starting point of this debate is a protest from some UCL students and faculty about UCL having a lecture room named after the late Francis Galton who was a professor there. Who further donated at his death most of his fortune to the university towards creating a professorship in eugenics. The protests are about Galton’s involvement in the eugenics movement of the late 18th and early 19th century. As well as professing racist opinions.

My first reaction after reading about these protests was why not?! Named places or lectures, as well as statues and other memorials, have a limited utility, especially when the named person is long dead and they certainly do not contribute in making a scientific theory [associated with the said individual] more appealing or more valid. And since “humans are [only] humans”, to quote Stephen Stigler speaking in this article, it is unrealistic to expect great scientists to be perfect, the more if one multiplies the codes for ethical or acceptable behaviours across ages and cultures. It is also more rational to use amphitheater MS.02 and lecture room AC.18 rather than associate them with one name chosen out of many alumni’s or former professors’.

Predictably, another reaction of mine was why bother?!, as removing Galton’s name from the items it is attached to is highly unlikely to change current views on eugenism or racism. On the opposite, it seems to detract from opposing the present versions of these ideologies. As some recent proposals linking genes and some form of academic success. Another of my (multiple) reactions was that as stated in the article these views of Galton’s reflected upon the views and prejudices of the time, when the notions of races and inequalities between races (as well as genders and social classes) were almost universally accepted, including in scientific publications like the proceedings of the Royal Society and Nature. When Karl Pearson launched the Annals of Eugenics in 1925 (after he started Biometrika) with the very purpose of establishing a scientific basis for eugenics. (An editorship that Ronald Fisher would later take over, along with his views on the differences between races, believing that “human groups differ profoundly in their innate capacity for intellectual and emotional development”.) Starting from these prejudiced views, Galton set up a scientific and statistical approach to support them, by accumulating data and possibly modifying some of these views. But without much empathy for the consequences, as shown in this terrible quote I found when looking for more material:

“I should feel but little compassion if I saw all the Damaras in the hand of a slave-owner, for they could hardly become more wretched than they are now…”

As it happens, my first exposure to Galton was in my first probability course at ENSAE when a terrific professor was peppering his lectures with historical anecdotes and used to mention Galton’s data-gathering trip to Namibia, literally measure local inhabitants towards his physiognomical views , also reflected in the above attempt of his to superpose photographs to achieve the “ideal” thief…

## The Seven Pillars of Statistical Wisdom [book review]

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , , , on June 10, 2017 by xi'an

I remember quite well attending the ASA Presidential address of Stephen Stigler at JSM 2014, Boston, on the seven pillars of statistical wisdom. In connection with T.E. Lawrence’s 1926 book. Itself in connection with Proverbs IX:1. Unfortunately wrongly translated as seven pillars rather than seven sages.

As pointed out in the Acknowledgements section, the book came prior to the address by several years. I found it immensely enjoyable, first for putting the field in a (historical and) coherent perspective through those seven pillars, second for exposing new facts and curios about the history of statistics, third because of a literary style one would wish to see more often in scholarly texts and of a most pleasant design (and the list of reasons could go on for quite a while, one being the several references to Jorge Luis Borges!). But the main reason is to highlight the unified nature of Statistics and the reasons why it does not constitute a subfield of either Mathematics or Computer Science. In these days where centrifugal forces threaten to split the field into seven or more disciplines, the message is welcome and urgent.

Here are Stephen’s pillars (some comments being already there in the post I wrote after the address):

1. aggregation, which leads to gain information by throwing away information, aka the sufficiency principle. One (of several) remarkable story in this section is the attempt by Francis Galton, never lacking in imagination, to visualise the average man or woman by superimposing the pictures of several people of a given group. In 1870!
2. information accumulating at the √n rate, aka precision of statistical estimates, aka CLT confidence [quoting  de Moivre at the core of this discovery]. Another nice story is Newton’s wardenship of the English Mint, with musing about [his] potential exploiting this concentration to cheat the Mint and remain undetected!
3. likelihood as the right calibration of the amount of information brought by a dataset [including Bayes’ essay as an answer to Hume and Laplace’s tests] and by Fisher in possible the most impressive single-handed advance in our field;
4. intercomparison [i.e. scaling procedures from variability within the data, sample variation], from Student’s [a.k.a., Gosset‘s] t-test, better understood and advertised by Fisher than by the author, and eventually leading to the bootstrap;
5. regression [linked with Darwin’s evolution of species, albeit paradoxically, as Darwin claimed to have faith in nothing but the irrelevant Rule of Three, a challenging consequence of this theory being an unobserved increase in trait variability across generations] exposed by Darwin’s cousin Galton [with a detailed and exhilarating entry on the quincunx!] as conditional expectation, hence as a true Bayesian tool, the Bayesian approach being more specifically addressed in (on?) this pillar;
6. design of experiments [re-enters Fisher, with his revolutionary vision of changing all factors in Latin square designs], with an fascinating insert on the 18th Century French Loterie,  which by 1811, i.e., during the Napoleonic wars, provided 4% of the national budget!;
7. residuals which again relate to Darwin, Laplace, but also Yule’s first multiple regression (in 1899), Fisher’s introduction of parametric models, and Pearson’s χ² test. Plus Nightingale’s diagrams that never cease to impress me.

The conclusion of the book revisits the seven pillars to ascertain the nature and potential need for an eight pillar.  It is somewhat pessimistic, at least my reading of it was, as it cannot (and presumably does not want to) produce any direction about this new pillar and hence about the capacity of the field of statistics to handle in-coming challenges and competition. With some amount of exaggeration (!) I do hope the analogy of the seven pillars that raises in me the image of the beautiful ruins of a Greek temple atop a Sicilian hill, in the setting sun, with little known about its original purpose, remains a mere analogy and does not extend to predict the future of the field! By its very nature, this wonderful book is about foundations of Statistics and therefore much more set in the past and on past advances than on the present, but those foundations need to move, grow, and be nurtured if the field is not to become a field of ruins, a methodology of the past!

## random walk on a torus [riddle]

Posted in Books, Kids, pictures with tags , , , , , , , , , on September 16, 2016 by xi'an

The Riddler of this week(-end) has a simple riddle to propose, namely given a random walk on the {1,2,…,N} torus with a ⅓ probability of death, what is the probability of death occurring at the starting point?

The question is close to William Feller’s famous Chapter III on random walks. With his equally famous reflection principle. Conditioning on the time n of death, which as we all know is definitely absorbing (!), the event of interest is a passage at zero, or any multiple of N (omitting the torus cancellation), at time n-1 (since death occurs the next time). For a passage in zero, this does not happen if n is even (since n-1 is odd) and else it is a Binomial event with probability

${n \choose \frac{n-1}{2}} 2^{-n}$

For a passage in kN, with k different from zero, kN+n must be odd and the probability is then

${n \choose \frac{n-1+kN}{2}} 2^{-n}$

which leads to a global probability of

$\sum_{n=0}^\infty \dfrac{2^n}{3^{n+1}} \sum_{k=-\lfloor (n-1)/N \rfloor}^{\lfloor (n+1)/N \rfloor} {n \choose \frac{n-1+kN}{2}} 2^{-n}$

i.e.

$\sum_{n=0}^\infty \dfrac{1}{3^{n+1}} \sum_{k=-\lfloor (n-1)/N \rfloor}^{\lfloor (n+1)/N \rfloor} {n \choose \frac{n-1+kN}{2}}$

Since this formula is rather unwieldy I looked for another approach in a métro ride [to downtown Paris to enjoy a drink with Stephen Stiegler]. An easier one is to allocate to each point on the torus a probability p[i] to die at position 1 and to solve the system of equations that is associated with it. For instance, when N=3, the system of equations is reduced to

$p_0=1/3+2/3 p_1, \quad p_1=1/3 p_0 + 1/3 p_1$

which leads to a probability of ½ to die at position 0 when leaving from 0. When letting N grows to infinity, the torus structure no longer matters and the probability of dying at position 0 implies returning in position 0, which is a special case of the above combinatoric formula, namely

$\sum_{m=0}^\infty \dfrac{1}{3^{2m+1}} {2m \choose m}$

which happens to be equal to

$\dfrac{1}{3}\,\dfrac{1}{\sqrt{1-4/9}}=\dfrac{1}{\sqrt{5}}\approx 0.4472$

as can be [unnecessarily] checked by a direct R simulation. This √5 is actually the most surprising part of the exercise!

## the philosophical importance of Stein’s paradox [a reply from the authors]

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , on January 15, 2016 by xi'an

[In the wake of my comment on this paper written by three philosophers of Science, I received this reply from Olav Vassend.]

Thank you for reading our paper and discussing it on your blog! Our purpose with the paper was to give an introduction to Stein’s phenomenon for a philosophical audience; it was not meant to — and probably will not — offer a new and interesting perspective for a statistician who is already very familiar with Stein’s phenomenon and its extensive literature.

I have a few more specific comments:

1. We don’t rechristen Stein’s phenomenon as “holistic pragmatism.” Rather, holistic pragmatism is the attitude to frequentist estimation that we think is underwritten by Stein’s phenomenon. Since MLE is sometimes admissible and sometimes not, depending on the number of parameters estimated, the researcher has to take into account his or her goals (whether total accuracy or individual-parameter accuracy is more important) when picking an estimator. To a statistician, this might sound obvious, but to philosophers it’s a pretty radical idea.

2. “The part connecting Stein with Bayes again starts on the wrong foot, since it is untrue that any shrinkage estimator can be expressed as a Bayes posterior mean. This is not even true for the original James-Stein estimator, i.e., it is not a Bayes estimator and cannot be a Bayes posterior mean.”

That seems to depend on what you mean by a “Bayes estimator.” It is possible to have an empirical Bayes prior (constructed from the sample) whose posterior mean is identical to the original James-Stein estimator. But if you don’t count empirical Bayes priors as Bayesian, then you are right.

3. “And to state that improper priors “integrate to a number larger than 1” and that “it’s not possible to be more than 100% confident in anything”… And to confuse the Likelihood Principle with the prohibition of data dependent priors. And to consider that the MLE and any shrinkage estimator have the same expected utility under a flat prior (since, if they had, there would be no Bayes estimator!).”

I’m not sure I completely understand your criticisms here. First, as for the relation between the LP and data-dependent priors — it does seem to me that the LP precludes the use of data-dependent priors.  If you use data from an experiment to construct your prior, then — contrary to the LP — it will not be true that all the information provided by the experiment regarding which parameter is true is contained in the likelihood function, since some of the information provided by the experiment will also be in your prior.

Second, as to our claim that the ML estimator has the same expected utility (under the flat prior) as a shrinkage prior that it is dominated by—we incorporated this claim into our paper because it was an objection made by a statistician who read and commented on our paper. Are you saying the claim is false? If so, we would certainly like to know so that we can revise the paper to make it more accurate.

4. I was aware of Rubin’s idea that priors and utility functions (supposedly) are non-separable, but I didn’t (and don’t) quite see the relevance of that idea to Stein estimation.

5. “Similarly, very little of substance can be found about empirical Bayes estimation and its philosophical foundations.”

What we say about empirical Bayes priors is that they cannot be interpreted as degrees of belief; they are just tools. It will be surprising to many philosophers that priors are sometimes used in such an instrumentalist fashion in statistics.

6. The reason why we made a comparison between Stein estimation and AIC was two-fold: (a) for sociological reasons, philosophers are much more familiar with model selection than they are with, say, the LASSO or other regularized regression methods. (b) To us, it’s precisely because model selection and estimation are such different enterprises that it’s interesting that they have such a deep connection: despite being very different, AIC and shrinkage both rely on a bias-variance trade-off.

7. “I also object to the envisioned possibility of a shrinkage estimator that would improve every component of the MLE (in a uniform sense) as it contradicts the admissibility of the single component MLE!”

I don’t think our suggestion here contradicts the admissibility of single component MLE. The idea is just that if we have data D and D’ about parameters φ and φ’, then the estimates of both φ and φ’ can sometimes be improved if the estimation problems are lumped together and a shrinkage estimator is used. This doesn’t contradict the admissibility of MLE, because MLE is still admissible on each of the data sets for each of the parameters.

Again, thanks for reading the paper and for the feedback—we really do want to make sure our paper is accurate, so your feedback is much appreciated. Lastly, I apologize for the length of this comment.

Olav Vassend

## the philosophical importance of Stein’s paradox

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , on November 30, 2015 by xi'an

I recently came across this paper written by three philosophers of Science, attempting to set the Stein paradox in a philosophical light. Given my past involvement, I was obviously interested about which new perspective could be proposed, close to sixty years after Stein (1956). Paper that we should actually celebrate next year! However, when reading the document, I did not find a significantly innovative approach to the phenomenon…

The paper does not start in the best possible light since it seems to justify the use of a sample mean through maximum likelihood estimation, which only is the case for a limited number of probability distributions (including the Normal distribution, which may be an implicit assumption). For instance, when the data is Student’s t, the MLE is not the sample mean, no matter how shocking that might sounds! (And while this is a minor issue, results about the Stein effect taking place in non-normal settings appear much earlier than 1998. And earlier than in my dissertation. See, e.g., Berger and Bock (1975). Or in Brandwein and Strawderman (1978).)

While the linear regression explanation for the Stein effect is already exposed in Steve Stigler’s Neyman Lecture, I still have difficulties with the argument in that for instance we do not know the value of the parameter, which makes the regression and the inverse regression of parameter means over Gaussian observations mere concepts and nothing practical. (Except for the interesting result that two observations make both regressions coincide.) And it does not seem at all intuitive (to me) that imposing a constraint should improve the efficiency of a maximisation program… Continue reading