paradoxes in scientific inference: a reply from the author
(I received the following set of comments from Mark Chang after publishing a review of his book on the ‘Og. Here they are, verbatim, except for a few editing and spelling changes. It’s a huge post as Chang reproduces all of my comments as well.)
Professor Christian Robert reviewed my book: “Paradoxes in Scientific Inference”. I found that the majority of his criticisms had no foundation and were based on his truncated way of reading. I gave point-by-point responses below. For clarity, I kept his original comments.
Robert’s Comments: This CRC Press book was sent to me for review in CHANCE: Paradoxes in Scientific Inference is written by Mark Chang, vice-president of AMAG Pharmaceuticals. The topic of scientific paradoxes is one of my primary interests and I have learned a lot by looking at Lindley-Jeffreys and Savage-Dickey paradoxes. However, I did not find a renewed sense of excitement when reading the book. The very first (and maybe the best!) paradox with Paradoxes in Scientific Inference is that it is a book from the future! Indeed, its copyright year is 2013 (!), although I got it a few months ago. (Not mentioning here the cover mimicking Escher’s “paradoxical” pictures with dices. A sculpture due to Shigeo Fukuda and apparently not quoted in the book. As I do not want to get into another dice cover polemic, I will abstain from further comments!)
Thank you, Robert for reading and commenting on part of my book. I had the same question on the copyright year being 2013 when it was actually published in previous year. I believe the same thing had happened to my other books too. The incorrect year causes confusion for future citations. The cover was designed by the publisher. They gave me few options and I picked the one with dices. I was told that the publisher has the copyright for the art work. I am not aware of the original artist.
Robert’s Comments: Now, getting into a deeper level of criticism (!), I find the book very uneven and overall quite disappointing. (Even missing in its statistical foundations.) Esp. given my initial level of excitement about the topic!
The book is intended for broad audience beyond statisticians which also includes general scientists. I have noticed the unevenness of the presentation despite my huge and sincere effort. One of the manuscript reviewers who had the experience on writing such type of a book (paradox) shared the same views. The nature of paradoxes concerns several subject fields and several layers of difficulties.
Re: Your comment – “Even missing in its statistical foundations”? – Unfortunately, I have found many of the comments are simple misinterpretations of what said in the book. Though, the comments appear to be very specific I am afraid that it is coming from a truncated way of reading and lacks in depth understanding.
Robert’s Comments: First, there is a tendency to turn everything into a paradox: obviously, when writing a book about paradoxes, everything looks like a paradox! This means bringing into the picture every paradox known to man and then some, i.e., things that are either un-paradoxical (e.g., Gödel’s incompleteness result) or uninteresting in a scientific book (e.g., the birthday paradox, which may be surprising but is far from a paradox!). Fermat’s theorem is also quoted as a paradox, even though there is nothing in the text indicating in which sense it is a paradox. (Or is it because it is simple to express, hard to prove?!) Similarly, Brownian motion is considered a paradox, as “reconcil[ing] the paradox between two of the greatest theories of physics (…): thermodynamics and the kinetic theory of gases” (p.51) For instance, the author considers the MLE being biased to be a paradox (p.117), while omitting the much more substantial “paradox” of the non-existence of unbiased estimators of most parameters—which simply means unbiasedness is irrelevant. Or the other even more puzzling “paradox” that the secondary MLE derived from the likelihood associated with the distribution of a primary MLE may differ from the primary. (My favourite!)
As defined in the first paragraph of the preface: Paradox is…. or something counter-intuitive. Some paradoxes are more attractive to certain readers but not others and vice versa. The paradoxes are carefully selected to balance between different readers. A large number of paradoxes were screened out. For instance, majority of paradoxes in math, biology, physics, and chemistry were not included. More than 50% paradoxes in the book “Paradoxes from A to Z” were not included (about 85 in the total), For statistics, the majority of paradoxes in “Paradoxes in Probability Theory and Mathematical Statistics”, which involve statistical details/formulations, were not included. For the same reason, I only took a few examples from “Counterexamples in Probability”. On the other hand, I included Fermat’s theorem because it was a conjecture and so counter-intuitive that people tried for centuries in vain to prove/ disprove it. Nevertheless, it is one of the most non-paradoxical (specially, after it becomes a theorem) ones in the book. Robert misread (again a superficial interpretation due to cursory reading !!) why Brownian motion is a paradox. I didn’t say it is called paradox because of “reconciling the paradox between…” In fact, in the next paragraph on p.50-51, it is clearly stated: “Brownian motion has some fantastic paradoxical properties, and here are three:” (1) it is everywhere continuous but nowhere differentiable – I cannot image how to draw such a curve manually, (2) it is a one-dimensional and also two-dimensional motion, and (3) it has fractal self-similarity property.
Everyone is so familiar with biasedness of some MLE. But it is so paradoxical (counter-intuitive) to me (even if there is only one biased MLE in the world) because: How could something real in the world most likely (maximum likelihood) occur in a biased way? How can thing occur most likely not in its true way? How can a truth be biased from the truth, even if there is only one such thing? The question is not about how many MLEs are biased!
Robert’s Comments: “When the null hypothesis is rejected, the p-value is the probability of the type I error.” Paradoxes in Scientific Inference (p.105)
“The p-value is the conditional probability given H.” Paradoxes in Scientific Inference (p.106)
Second, the depth of the statistical analysis in the book is often found missing. For instance, Simpson’s paradox is not analysed from a statistical perspective, only reported as a fact. Sticking to statistics, take for instance the discussion of Lindley’s paradox. The author seems to think that the problem is with the different conclusions produced by the frequentist, likelihood, and Bayesian analyses (p.122). This is completely wrong: Lindley’s (or Lindley-Jeffreys‘s) paradox is about the lack of significance of Bayes factors based on improper priors. Similarly, when the likelihood ratio test is introduced, the reference threshold is given as equal to 1 and no mention is later made of compensating for different degrees of freedom/against over-fitting. The discussion about p-values is equally garbled, witness the above quote which (a) conditions upon the rejection and (b) ignores the dependence of the p-value on a realized random variable.
It is true that the mathematical/statistical details were reduced to minimum as I stated in the preface: “Although I have tried to keep mathematical formulations minimal, I have not totally eliminated them, so as to avoid mathematical anxiety that might result from either approach”. Apparently I cannot avoid but cause such an anxiety !!
There are two different definitions of Lindley’s paradox people talk about: the first one (the one I use) is about the conclusions from frequentists and Bayesianists, and second one is about lack of significance of Bayes factors based on improper priors (Robert’s definition). The first definition is more popular than the second one in my opinion. Let me just list a few sources for the first definition: Glenn Shafer (professor from Stanford University), Journal of the American Statistical Association Vol. 77, No. 378 (Jun., 1982), pp. 325-334, Green and Elgersma (2010) cited on p.124 of the book, and Wikipedia.com. I also searched on google.com for the definition, over 30 initial links used the first definition that I have used, however none of them used the second one (Robert’s definition).
Regarding p-value, I did mention (on p.105 of the book) the dependence on a realization of random variable: “… p-value, which is defined as the (least upper bound of the) probability of getting data the same as or more extreme than the observed data when the null hypothesis Ho is true.” Again, Robert read this in a truncated way. The actual text reads: “…we can measure the strength of the evidence against Ho using the p-value, which is defined as the (least upper bound of the) probability of getting data the same as or more extreme than the observed data when the null hypothesis Ho is true. The p-value will be compared with a nominal threshold α (e.g., 0.05) to determine if the null hypothesis should be rejected. When the null hypothesis is rejected, the p-value is the probability of the type I error (or type I error rate).” My intention is to emphasize (1) the “true” false positive (type-I) error can only happen when the null hypothesis is true and rejected; (2) p-value can be associated with “true” type-I error rate if we reject the null hypothesis based on the p-value calculated from the observed data ; x (the probability of getting data the same or more extreme than this x when the null hypothesis is true and the experiment is repeated); that is, the false rejection rate (Type-I error rate) is equal to the p-value. Such an interpretation is important in practice because when a clinical trial is done and the null hypothesis is rejected, we are often asked by physicians and alike: what does the p-value (e.g., 0.001) mean in terms of type-I error rate? What is the difference in terms of type-I error between e.g., p-value 0.001 and 0.01? On the other hand, if the p-value is larger (e.g., 0.4) and the null hypothesis is not rejected, the type-I error is not a concern (does not actually occur). I have tried hard in the book to make it is easy to non-statistical readers and accurate enough for statistically sophisticated readers too. Apparently, what I have done is not enough. Especially, some of the phrases when taken out of the context it can be misleading.
Robert’s Comments: “The peaks of the likelihood function indicate (on average) something other than the distribution associated with the drawn sample. As such, how can we say the likelihood is evidence supporting the distribution?” Paradoxes in Scientific Inference (p.119)
The chapter on statistical controversies actually focus on the opposition between frequentist, likelihood, and Bayesian paradigms. The author seems to have studied Mayo and Spanos’ Error and Inference to great lengths. (As I did, as I did!) He spends around twenty pages in Chapter 3 on this opposition and on the conditionality, sufficiency, and likelihood principles that were reunited by Birnbaum and recently deconstructed by Mayo. In my opinion, Chang makes a mess of describing the issues at stake in this debate and leaves the reader more bemused at the end than at the beginning of the chapter. For instance, the conditionality principle is confused with the p-value being computed conditional on the null (hypothesis) model (p.110). Or the selected experiment being unknown (p.110). The likelihood function is considered as a sufficient statistic (p.137). The “paradox” of an absence of non-trivial sufficient statistics in all models but exponential families (the Pitman-Koopman lemma) is not mentioned. The fact that ancillary statistics bring information about the precision of a sufficient statistic is presented as a paradox (p.112). Having the same physical parameter θ is confused with having the same probability distribution indexed by θ, which is definitely not the same thing (p.115)! The likelihood principle is confused with the likelihood ratio test (p.117) and with the maximum likelihood estimation (witness the above quote). The dismissal of Mayo’s rejection of Birnbaum’s proof—a rejection I fail to understand—is not any clearer: “her statement about the sufficient statistic under a mixed distribution (a fixed distribution) is irrelevant” (p.138). This actually made me think of another interpretation of Mayo’s argument that could prove her right! More on that in another post.
The criticism about my confusion about the likelihood principle with the p-value is again simply wrong. Think, for example, the paradox of conditionality principle on p.110 is in the mixed experiment setting as late consistently described in detail on p.137 about the Birnhaum experiment (which Robert is familiar with). The different experiments E’ and E’’ indeed can be corresponding to the Ho and Ha, respectively. This is a philosophical point and needs clarification: different values of a physical parameter always associate different experiments, at least for frequentist (parameter is considered a fixed value for a frequentist). For a Bayesian scientist, a parameter (indexed by θ) that is studied with an experiment can have distribution . I would call such parameter θ is a representation of physical parameter in the knowledge space or simply the experimenter’s knowledge of the corresponding physical parameter. Because it is knowledge of a parameter, it incorporates explicitly or implicitly knowledge from other things (e.g., results from experiments with different populations/subjects) and in this sense, it is sourced from different physical parameters. Such knowledge pooling is covered by what I call causal space in the book – when we make any statement with probability, the statement can be viewed as a statement about an aggregative property of a group of similar things in a causal space.
The phrase “… since the likelihood itself is a sufficient statistic” on p.137 was obviously unintentionally left there. I apologize for the error and the inconvenience caused to the reader(s). Thanks for pointing it out.
There is no confusion about same physical parameter θ and having the same probability distribution indexed by θ. In fact, the paradox is the discussion of the controversies about the same distribution indexed by θ in the likelihood principle. Let me make some further clarification: The meaning of the same probability distribution indexed by θ is not well defined because we consider virtually any two different distributions as the same distribution indexed by θ. For instance, suppose f(θ) and g(θ) are two distributions (hypothesis testing, Ho: θ=0 and Ha: θ=1), we can say f(θ) and g(θ) are the same distribution F(θ) indexed by θ, where F(θ) = θ f(θ)+(1- θ)g(θ)+ θ(1- θ)k(θ), where k(θ) is an appropriate random function. The example given on p.115-116 of the book involves binomial distribution f(θ) and negative binomial distribution g(θ). Furthermore, the θ does need to be a related physical parameter when we use the likelihood principle. Otherwise, we can substrate an arbitrary constant from θ in g(θ), so that the new F(θ) = θ f(θ)+(1- θ)g(θ+c)+ θ(1- θ)k(θ), in which the index θ in f(θ) and g(θ+c) could have completely different meaning.
I don’t know where the reviewer got the impression that I was confused likelihood ratio test with likelihood principle. I did not say or imply anywhere that MLE or likelihood ratio test is a consequence of likelihood principle. I have explicitly written under separate subsections: Section 3.2.3 Likelihood principle (p.113) and 3.2.4 Law of likelihood (p.117). The reviewer may have confused the likelihood principle with the law of likelihood, which are two totally distinct things. I used the law of likelihood (not likelihood principle!) to introduce the likelihood ratio test. In the paradox of likelihood principle (p.117) and the paradox of law of likelihood (p.117), I have challenged (at least in some cases) the popular interpretation that a likelihood function represents the relative plausibility and use biased MLEs as the discussion point to further challenge the likelihood principle (not in the way most frequentists did from type-I point of view!). I did not use MLE directly to challenge the likelihood principle since they are irrelevant in a sense, but use the biasedness of some MLE to raise the controversies. Specifically, in the text on p.117 and 119, the paradox of likelihood principle and paradox of law of likelihood raise the controversy: The peak of a likelihood function (corresponding MLE) does not associate with unbiased estimate for biased estimators, thus the likelihood function does not appear to be (at least in the cases of biased estimators) the relative plausibility of supporting different values of the parameter, which in turns, the likelihood principle (which concerns the ratio of the likelihood functions – not likelihood test!) is questionable.
It is really disappointing to know that the reviewer thought that I was confused about such basic, fundamental and simple concepts of MLE, likelihood ratio test and likelihood principle.
Robert’s Comments: “From a single observation x from a normal distribution with unknown mean μ and standard deviation σ it is possible to create a confidence interval on μ with finite length.” Paradoxes in Scientific Inference (p.103)
One of the first paradoxes in the statistics chapter is the one endorsed by the above quote. I found it intriguing that this interval could be of the form x±η|x| with η only depending on the confidence coverage… Then I checked and saw that the confidence coverage was defined by default, i.e., the actual coverage is at least the nominal coverage, which is much less exciting (and much less paradoxical).
“One of the proudest accomplishments of my childhood was creating an electric bell, though later I found it was just a reinvention. Other reinventions I remember are discovering some of the interesting properties of the number 9 and the solution for a general quadratic equation.“ Paradoxes in Scientific Inference (p.24)
The book abounds in quotes like the above, where the author does not shy away from promoting himself. For instance, on page 2, he adds his own quotes to a list of aphorisms from major figures like Montaigne, Lao-Tzu, or Picasso. Take also the gem “I will feel so rewarded if this book can help a young reader in some way to become a thinker” (p.viii) The author further claims several times to bring a unification of the frequentist and Bayesian perspectives, even though I fail to see how he did it. E.g., “whether frequentist or Bayesian, concepts of probability are based on the collection of similar phenomena or experiments” (p.63) does not bring a particularly clear answer. Similarly, the murky discussion of the Monty Hall dilemma does not characterise the distinction between frequentist and Bayesian reasoning (if anything, this is a frequentist setting). A last illustration is the ‘paradox of posterior distributions’ (p.124) where Cheng got it plain wrong about the sequential update of a prior distribution not being equal to the final posterior (see, e.g., Section 1.4 in The Bayesian Choice). A nice quote is recycled from my book though (a completely irrelevant anecdote is that George Casella actually hated this quote!):
“If you believe anything happens (…) for a reason, then samples may never be independent, else there would be no randomness. Just as T. Hilberman [sic] put it (Robert 1994): “From where we stand, the rain seems random. If we could stand somewhere else, we would see the order in it.” Paradoxes in Scientific Inference (p.140)
I think this criticism is totally out of context and uncalled for. I feel a little bit harsh since this criticism is at a personal level. I admit that I do some self-promoting as many people do. Publishing a book is considered a self-promoting. Sometimes, I announce my upcoming books at conferences. That’s self-promoting!! But I don’t see how the text on p.24 (Robert quoted above) can be self-promoting. During the 10-year “Cultural Revolution” in China, anyone who dared to do any research should proud of himself/herself because it was against the social norm or authority and could be punished (there was no law at that time). How can someone who reinvented something as simple as an electric bell and a solution to quadratic equations that were invented or solved hundreds/ thousands of years ago is self-promoting? If you read the sentence that followed on the same paragraph, you would know what I am really proud of: “…However, I am still proud of myself even knowing they are just reinventions because I did them all in the “dark age” of China: the 10 years of the Chinese Cultural Revolution.” I was just reliving the moment some 40 years ago.The quote “Help… become thinker” may or may not be considered as self-promoting in the context. I fully understand where the criticism came from. I should have phrasing it differently. Working in the industry, productivity is often over-emphasized. As a result, many young industry colleagues don’t have enough opportunities or time to think in their work and be more creative. “A statistician should be a thinker not a sample size calculator”- was my quote to them. Hence, I hope now you probably know by what I meant here by using the term “thinker”.
The quotes I put on page 2 are what I found interesting and most relevant. They are not necessarily from aphorisms from major figures, you would know if you saw the name Shaw (Marvin C. Shaw, I found the quote on the Web). I don’t know who Shaw is even after I did research again a few days ago soon after I read Robert’s comments. All I know up to now he/she has published two books about religions on Amazon. No affiliation is provided. I have put my full name (instead of just last name) under quotes to clearly identify myself. Readers would not be such naïve that they think I am at par with such great personalities and leaders just because my name appears alongside them in my own book – I would not expect my readers to be at this low level of intelligence. Quoting oneself in your own book is a funny way of writing, but not necessarily self-promoting.
Regarding the criticism about the sequential update of a prior distribution not being equal to the final posterior, it is not specific enough for me to make comments (my name is misspelled here as Cheng instead of Chang). By the way, Robert cites his book twice here, but by no means is considered as self-promoting. In fact, I read his book (1st Ed., 1994). It is a nice book and made me think (not I am a thinker yet).
Regarding the unification of different statistical paradigms, it is mainly the concept of causal space as briefly touched early. Whether I have indeed contributed anything on this or it is an overstatement, I look forward to receiving more comments from other readers.
Robert’s Comments: Most surprisingly, the book contains exercises in every chapter, whose purpose is lost on me. What is the point in asking to students “Write an essay on the role of the Barber’s Paradox in developing modern set theory” or “How does the story of Achilles and the tortoise address the issues of the sum of infinite numbers of arbitrarily small numbers”..?! Not to mention the top one: “Can you think of any applications from what you have learned from this chapter?” Erm…frankly, no!
I quite understand the criticism. The exercises were added in the last minute as one of the initial reviewers suggested. Some of exercises are interesting than others depending on the reader’s interest. Real world applications of paradoxes in the field of accounting, computer science, transportation, electronic network design, etc. are clearly presented in the book. Please be aware that this book is intended for broader audience beyond just the statisticians. The level of mathematics and statistics was considerably reduced from the initial draft to incorporate the comments from the manuscript reviewers.
Finally, I truly thank Professor Robert for reading the book and straightforward criticisms. I feel it would have been more helpful had he read more carefully and had not misinterpreted several central concepts mentioned in this book.