Archive for Simpson’s paradox

a Simpson paradox of sorts

Posted in Books, Kids, pictures, R with tags , , , , , , , , , on May 6, 2016 by xi'an

The riddle from The Riddler this week is about finding an undirected graph with N nodes and no isolated node such that the number of nodes with more connections than the average of their neighbours is maximal. A representation of a connected graph is through a matrix X of zeros and ones, on which one can spot the nodes satisfying the above condition as the positive entries of the vector (X1)^2-(X^21), if 1 denotes the vector of ones. I thus wrote an R code aiming at optimising this target

targe <- function(F){
  sum(F%*%F%*%rep(1,N)/(F%*%rep(1,N))^2<1)}

by mere simulated annealing:

rate <- function(N){ 
# generate matrix F
# 1. no single 
F=matrix(0,N,N) 
F[sample(2:N,1),1]=1 
F[1,]=F[,1] 
for (i in 2:(N-1)){ 
if (sum(F[,i])==0) 
F[sample((i+1):N,1),i]=1 
F[i,]=F[,i]} 
if (sum(F[,N])==0) 
F[sample(1:(N-1),1),N]=1 
F[N,]=F[,N] 
# 2. more connections 
F[lower.tri(F)]=F[lower.tri(F)]+
  sample(0:1,N*(N-1)/2,rep=TRUE,prob=c(N,1)) 
F[F>1]=1
F[upper.tri(F)]=t(F)[upper.tri(t(F))]
#simulated annealing
T=1e4
temp=N
targo=targe(F)
for (t in 1:T){
  #1. local proposal
  nod=sample(1:N,2)
  prop=F
  prop[nod[1],nod[2]]=prop[nod[2],nod[1]]=
     1-prop[nod[1],nod[2]]
  while (min(prop%*%rep(1,N))==0){
    nod=sample(1:N,2)
    prop=F
    prop[nod[1],nod[2]]=prop[nod[2],nod[1]]=
     1-prop[nod[1],nod[2]]}
  target=targe(prop)
  if (log(runif(1))*temp<target-targo){ 
    F=prop;targo=target} 
#2. global proposal 
  prop=F prop[lower.tri(prop)]=F[lower.tri(prop)]+
   sample(c(0,1),N*(N-1)/2,rep=TRUE,prob=c(N,1)) 
prop[prop>1]=1
  prop[upper.tri(prop)]=t(prop)[upper.tri(t(prop))]
  target=targe(prop)
  if (log(runif(1))*temp<target-targo){
      F=prop;targo=target}
   temp=temp*.999
   }
return(F)}

Eward SimpsonThis code returns quite consistently (modulo the simulated annealing uncertainty, which grows with N) the answer N-2 as the number of entries above average! Which is rather surprising in a Simpson-like manner since all entries but two are above average. (Incidentally, I found out that Edward Simpson recently wrote a paper in Significance about the Simpson-Yule paradox and him being a member of the Bletchley Park Enigma team. I must have missed out the connection with the Simpson paradox when reading the paper in the first place…)

paradoxes in scientific inference

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , on November 23, 2012 by xi'an

This CRC Press book was sent to me for review in CHANCE: Paradoxes in Scientific Inference is written by Mark Chang, vice-president of AMAG Pharmaceuticals. The topic of scientific paradoxes is one of my primary interests and I have learned a lot by looking at Lindley-Jeffreys and Savage-Dickey paradoxes. However, I did not find a renewed sense of excitement when reading the book. The very first (and maybe the best!) paradox with Paradoxes in Scientific Inference is that it is a book from the future! Indeed, its copyright year is 2013 (!), although I got it a few months ago. (Not mentioning here the cover mimicking Escher’s “paradoxical” pictures with dices. A sculpture due to Shigeo Fukuda and apparently not quoted in the book. As I do not want to get into another dice cover polemic, I will abstain from further comments!)

Now, getting into a deeper level of criticism (!), I find the book very uneven and overall quite disappointing. (Even missing in its statistical foundations.) Esp. given my initial level of excitement about the topic!

First, there is a tendency to turn everything into a paradox: obviously, when writing a book about paradoxes, everything looks like a paradox! This means bringing into the picture every paradox known to man and then some, i.e., things that are either un-paradoxical (e.g., Gödel’s incompleteness result) or uninteresting in a scientific book (e.g., the birthday paradox, which may be surprising but is far from a paradox!). Fermat’s theorem is also quoted as a paradox, even though there is nothing in the text indicating in which sense it is a paradox. (Or is it because it is simple to express, hard to prove?!) Similarly, Brownian motion is considered a paradox, as “reconcil[ing] the paradox between two of the greatest theories of physics (…): thermodynamics and the kinetic theory of gases” (p.51) For instance, the author considers the MLE being biased to be a paradox (p.117), while omitting the much more substantial “paradox” of the non-existence of unbiased estimators of most parameters—which simply means unbiasedness is irrelevant. Or the other even more puzzling “paradox” that the secondary MLE derived from the likelihood associated with the distribution of a primary MLE may differ from the primary. (My favourite!)

When the null hypothesis is rejected, the p-value is the probability of the type I error.Paradoxes in Scientific Inference (p.105)

The p-value is the conditional probability given H0.” Paradoxes in Scientific Inference (p.106)

Second, the depth of the statistical analysis in the book is often found missing. For instance, Simpson’s paradox is not analysed from a statistical perspective, only reported as a fact. Sticking to statistics, take for instance the discussion of Lindley’s paradox. The author seems to think that the problem is with the different conclusions produced by the frequentist, likelihood, and Bayesian analyses (p.122). This is completely wrong: Lindley’s (or Lindley-Jeffreys‘s) paradox is about the lack of significance of Bayes factors based on improper priors. Similarly, when the likelihood ratio test is introduced, the reference threshold is given as equal to 1 and no mention is later made of compensating for different degrees of freedom/against over-fitting. The discussion about p-values is equally garbled, witness the above quote which (a) conditions upon the rejection and (b) ignores the dependence of the p-value on a realized random variable. Continue reading