chain event graphs [RSS Midlands seminar]

Posted in pictures, Statistics, University life with tags , , , , , , , , , , on October 16, 2013 by xi'an

Last evening, I attended the RSS Midlands seminar here in Warwick. The theme was chain event graphs (CEG), As I knew nothing about them, it was worth my time listening to both speakers and discussing with Jim Smith afterwards. CEGs are extensions of Bayes nets with originally many more nodes since they start with the probability tree involving all modalities of all variables. Intensive Bayesian model comparison is then used to reduce the number of nodes by merging modalities having the same children or removing variables with no impact on the variable of interest. So this is not exactly a new Bayes net based on modality dummies as nodes (my original question). This is quite interesting, esp. in the first talk illustration of using missing value indicators as a supplementary variable (to determine whether or not data is missing at random). I also wonder how much of a connection there is with variable length Markov chains (either as a model or as a way to prune the tree). A last vague idea is a potential connection with lumpable Markov chains, a concept I learned from Kemeny & Snell (1960): a finite Markov chain is lumpable if by merging two or more of its states it remains a Markov chain. I do not know if this has ever been studied from a statistical point of view, i.e. testing for lumpability, but this sounds related to the idea of merging modalities of some variables in the probability tree…

Medical illuminations [book review]

Posted in Books, pictures, Statistics with tags , , , , on September 27, 2013 by xi'an

Howard Wainer wrote another book, about to be published by Oxford University Press, called Medical Illuminations. (The book is announced for January 2 on amazon. A great New Year gift to be sure!) While I attended WSC 2013 in Hong Kong and then again at the RSS Annual Conference in Newcastle, I saw a preliminary copy of the book and asked the representative of OUP if I could get a copy for CHANCE (by any chance?!)… And they kindly sent me a copy the next day!

“This is an odd book (…) gallop[ing] off in all directions at once.” (p.152)

As can be seen from the cover, which reproduces the great da Vinci’s notebook page above (and seen also from the title where illuminations flirts with illuminated [manuscript]), the book focus on visualisation of medical data to “improve healthcare”. Its other themes are using evidence and statistical thinking towards the same goal. Since I was most impressed by the graphical part, I first thought of entitling the post as “Howard does his Tufte (before wondering at the appropriateness of such a title)!

“As hard as this may be to believe, this display is not notably worse than many of the others containd in this remarkable volume.” (p.78)

In fact, this first section is very much related with CHANCE in that a large sequence of graphics were submitted by CHANCE readers when Howard Wainer launched a competition in the magazine for improving upon a Nightingale-like representation by Burtin of antibiotics efficiency. It starts from a administrative ruling that the New York State Health Department had to publish cancer maps overlayed with potentially hazardous sites without any (interpretation) buffer. From there, Wainer shows how the best as well as the worst can be made of graphical representations of statistical data. It reproduces (with due mention) Tufte‘s selection of Minard‘s rendering of the Napoleonic Russian campaign as the best graph ever… The corresponding chapters of the book keep their focus on medical data, with some commentaries on the graphical quality of the 2008 National Healthcare Quality Report (ans.: could do better!). While this is well-done and with a significant message, I would still favour Tufte for teaching data users to present their findings in the most effective way. An interesting final chapter for the section is about “controlling creativity” where Howard Wainer follows in the steps of John Tukey about the Atlas of United States Mortality, And then shows a perfectly incomprehensible chart taken from Understanding USA, a not very premonitory title… (Besides Howard’s conclusion quoted above, you should also read the one-star comments on amazon!)

“Of course, it is impossible to underestimate the graphical skills of the mass media.” (p.164)

Section II is about a better use of statistics and of communicating those statistics towards improving healthcare, from fighting diabetes, to picking the right treatment for hip fractures (from an X-ray),  to re-evaluate detection tests (for breast and prostate cancers) as possibly very inefficient, and to briefly wonder about accelerated testing. And Section III tries to explain why progress (by applying the previous recommendations) has not been more steady. It starts with a story about the use of check-lists in intensive care and the dramatic impact on their effectiveness against infections. (The story hit home as I lost my thumb due to an infection while in intensive care! Maybe a check-list would have helped. Maybe.)  The next chapter contrasts the lack of progress in using check-lists with the adoption of the Korean alphabet in Korea, a wee unrelated example given the focus of the book. (Overall, I find most of the final chapters on the weak side of the book.)

This is indeed an odd book, with a lot of clever remarks and useful insights, but not so much with a driving line that would have made Wainer’s Medical Illuminations more than the sum of its components. Each section and most chapters (!) contain sensible recommendations for improving the presentation and exploitation of medical data towards practitioners and patients. I however wonder how much the book can impact the current state of affairs, like producing better tools for monitoring one’s own diabetes. So, in the end, I recommend the reading of Medical Illuminations as a very pleasant moment, from which examples and anecdotes can be borrowed for courses and friendly discussions. For non-statisticians, it is certainly a worthy entry on the relevance of statistical processing of (raw) data.

random sudokus

Posted in Books, R, Statistics with tags , , , on June 4, 2013 by xi'an

In a paper arXived on Friday, Roberto Fontana relates the generation of Sudoku grids to the one of Latin squares (which is unsurprising) and to maximum cliques of a graph (more surprising). The generation of a random Latin square proceeds in three steps:

1. generate a random Latin square L with identity permutation matrix on symbol 1 (in practice, this implies building the corresponding graph and picking one of the largest cliques at random);
2. modify L into L’ using a random permutation of the symbols 2,…,n in L’;
3. modify L’ into L” by a random permutation of the columns of L’.

A similar result holds for Sudokus (with the additional constraint on the regions). However, while the result is interesting in its own right, it only covers full Sudokus, rather than partially filled Sudokus with a unique solution, whose random production could be more relevant. (Or maybe not, given that the difficulty matters.) [The code uses some R packages, but then moves to SAS, rather surprisingly.]

ugly graph of the day

Posted in Books, Statistics with tags , , on February 25, 2012 by xi'an

Julien pointed out to me this terrible graph, where variations cannot be perceived and where the circles are meaningless! Two three-point curves would have been way more explicit!!!

Back from Philly

Posted in R, Statistics, Travel, University life with tags , , , , , , , , , on December 21, 2010 by xi'an

The conference in honour of Larry Brown was quite exciting, with lots of old friends gathered in Philadelphia and lots of great talks either recollecting major works of Larry and coauthors or presenting fairly interesting new works. Unsurprisingly, a large chunk of the talks was about admissibility and minimaxity, with John Hartigan starting the day re-reading Larry masterpiece 1971 paper linking admissibility and recurrence of associated processes, a paper I always had trouble studying because of both its depth and its breadth! Bill Strawderman presented a new if classical minimaxity result on matrix estimation and Anirban DasGupta some large dimension consistency results where the choice of the distance (total variation versus Kullback deviance) was irrelevant. Ed George and Susie Bayarri both presented their recent work on g-priors and their generalisation, which directly relate to our recent paper on that topic. On the afternoon, Holger Dette showed some impressive mathematics based on Elfving’s representation and used in building optimal designs. I particularly appreciated the results of a joint work with Larry presented by Robert Wolpert where they classified all Markov stationary infinitely divisible time-reversible integer-valued processes. It produced a surprisingly small list of four cases, two being trivial.. The final talk of the day was about homology, which sounded a priori rebutting, but Robert Adler made it extremely entertaining, so much that I even failed to resent the powerpoint tricks! The next morning, Mark Low gave a very emotional but also quite illuminating about the first results he got during his PhD thesis at Cornell (completing the thesis when I was using Larry’s office!). Brenda McGibbon went back to the three truncated Poisson papers she wrote with Ian Johnstone (via gruesome 13 hour bus rides from Montréal to Ithaca!) and produced an illuminating explanation of the maths at work for moving from the Gaussian to the Poisson case in a most pedagogical and enjoyable fashion. Larry Wasserman explained the concepts at work behind the lasso for graphs, entertaining us with witty acronyms on the side!, and leaving out about 3/4 of his slides! (The research group involved in this project produced an R package called huge.) Joe Eaton ended up the morning with a very interesting result showing that using the right Haar measure as a prior leads to a matching prior, then showing why the consequences of the result are limited by invariance itself. Unfortunately, it was then time for me to leave and I will miss (in both meanings of the term) the other half of the talks. Especially missing Steve Fienberg’s talk for the third time in three weeks! Again, what I appreciated most during those two days (besides the fact that we were all reunited on the very day of Larry’s birthday!) was the pain most speakers went to to expose older results in a most synthetic and intuitive manner… I also got new ideas about generalising our parallel computing paper for random walk Metropolis-Hastings algorithms and for optimising across permutation transforms.

Le Monde puzzle [48: resolution]

Posted in R, Statistics with tags , , , on December 4, 2010 by xi'an

The solution to puzzle 48 given in Le Monde this weekend is rather direct (which makes me wonder why the solution for 6 colours is still unavailable..) Here is a quick version of the solution: Continue reading

Le Monde puzzle [48]

Posted in R, Statistics with tags , , , , , , on December 2, 2010 by xi'an

This week(end), the Le Monde puzzle can be (re)written as follows (even though it is presented as a graph problem):

Given a square 327×327 symmetric matrix A, where each non-diagonal entry is in {1,2,3,4,5} and $\text{diag}(A)=0$, does there exist a triplet (i,j,k) such that

$a_{ij} = a_{jk} = a_{ki}$

Solving this problem in R is very easy. Or appears to be. We can indeed create a random matrix A and check whether or not any of the five triple indicator matrices

$(A==c)^3\,,\qquad c=1,\ldots,5,$

has a non-zero diagonal entry. Indeed, since $B=A^3$ satisfies

$b_{uu} = \sum_{v,w} a_{uv}a_{vw}a_{wu}$

there is a non-zero entry iff there exists a triplet (u,v,w) such that the product $a_{uv}a_{vw}a_{wu}$ is different from zero. Here is the R code:

chec=1
for (mc in 1:10^3){

#random filled matrix
A=matrix(sample(1:5,(357)^2,rep=TRUE),357)
A=A*upper.tri(A)+t(A*upper.tri(A))

agr=0
for (t in 1:5){
B=(A==t)
agr=max(agr,max(diag(B%*%B%*%B)>0))
if (agr==1){ break()}
}

chec=min(chec,agr)
if (chec==0){ break()}
}


I did run the above code and did not find any case where no triplet was sharing the same number. Neither did I for 326, 325,… But Robin Ryder told me that this is a well-known problem in graph theory that goes under the name of Ramsey’s problem and that 327 is an upper bound on the number of nodes for the existence of the triplet with 5 colours/numbers. So this is another illustration of a case when crude simulation cannot exhibit limiting cases in order to void a general property, because of the number of possible cases. To improve the chances of uncovering the boudary value, we would need a simulation that dis-favours triplets with the same number. I am also curious to see Le Monde solution tomorrow, since finding Ramsey’s numbers seems to be a hard problem, no solution being provided for 6 colours/numbers.

Incidentally, I wonder if there is a faster way to produce a random symmetric matrix than the cumbersome

A=matrix(sample(1:5,(357)^2,rep=TRUE,357)
A=A*upper.tri(A)+t(A*upper.tri(A))


Using the alternative saving on the lower triangular part

A=matrix(sample(1:5,(357)^2,rep=TRUE,357)
A[upper.tri(A)]=sample(1:5,357*178,rep=TRUE)
A=A*upper.tri(A)+t(A*upper.tri(A))


certainly takes longer…