## graphe, graphons, graphez !

Posted in Books, pictures, Statistics, University life with tags , , , , , , on December 3, 2018 by xi'an

## clustering dynamical networks

Posted in pictures, Statistics, University life with tags , , , , , , , , , , on June 5, 2018 by xi'an

Yesterday I attended a presentation by Catherine Matias on dynamic graph structures, as she was giving a plenary talk at the 50th French statistical meeting, conveniently located a few blocks away from my office at ENSAE-CREST. In the nicely futuristic buildings of the EDF campus, which are supposed to represent cogs according to the architect, but which remind me more of these gas holders so common in the UK, at least in the past! (The E of EDF stands for electricity, but the original public company handled both gas and electricity.) This was primarily a survey of the field, which is much more diverse and multifaceted than I realised, even though I saw some recent developments by Antonietta Mira and her co-authors, as well as refereed a thesis on temporal networks at Ca’Foscari by Matteo Iacopini, which defence I will attend in early July. The difficulty in the approaches covered by Catherine stands with the amount and complexity of the latent variables induced by the models superimposed on the data. In her paper with Christophe Ambroise, she followed a variational EM approach. From the spectator perspective that is mine, I wondered at using ABC instead, which is presumably costly when the data size grows in space or in time. And at using tensor structures as in Mateo’s thesis. This reminded me as well of Luke Bornn’s modelling of basketball games following each player in real time throughout the game. (Which does not prevent the existence of latent variables.) But more vaguely and speculatively I also wonder at the meaning of the chosen models, which try to represent “everything” in the observed process, which seems doomed from the start given the heterogeneity of the data. While reaching my Keynesian pessimistic low- point, which happens rather quickly!, one could hope for projection techniques, towards reducing the dimension of the data of interest and of the parameter required by the model.

## normalising constants of G-Wishart densities

Posted in Books, Statistics with tags , , , , , , on June 28, 2017 by xi'an

Abdolreza Mohammadi, Hélène Massam, and Gérard Letac arXived last week a paper on a new approximation of the ratio of two normalising constants associated with two G-Wishart densities associated with different graphs G. The G-Wishart is the generalisation of the Wishart distribution by Alberto Roverato to the case when some entries of the matrix are equal to zero, which locations are associated with the graph G. While enjoying the same shape as the Wishart density, this generalisation does not enjoy a closed form normalising constant. Which leads to an intractable ratio of normalising constants when doing Bayesian model selection across different graphs.

Atay-Kayis and Massam (2005) expressed the ratio as a ratio of two expectations, and the current paper shows that this leads to an approximation of the ratio of normalising constants for a graph G against the graph G augmented by the edge e, equal to

Γ(½{δ+d}) / 2 √π Γ(½{δ+d+1})

where δ is the degree of freedom of the G-Wishart and d is the number of minimal paths of length 2 linking the two end points of e. This is remarkably concise and provides a fast approximation. (The proof is quite involved, by comparison.) Which can then be used in reversible jump MCMC. The difficulty is obviously in evaluating the impact of the approximation on the target density, as there is no manageable available alternative to calibrate the approximation. In a simulation example where such an alternative is available, the error is negligible though.

## untangling riddle

Posted in Statistics with tags , , on December 23, 2016 by xi'an

This week riddle is rather anticlimactic [my rewording]:

N wires lead from the top of a tower down to the ground floor. But they have become hopelessly tangled. The wires on the top floor are all labelled, from 1 to N. The wires on the bottom, however, are not.

You need to figure out which ground-floor wire’s end corresponds to which top-floor wire’s end. On the ground floor, you can tie two wire ends together, and you can tie as many of these pairs as you like. You can then walk to the top floor and use a circuit detector to check whether any two of the wires at the top of the tower are connected or not.

What is the smallest number of trips to the top of the tower you need to take in order to correctly label all the wires on the bottom?

Indeed, first, if N=2, the wires cannot be identified, if N=3, they can be identified in 2 steps, and if N=4, they can also be identified in 2 steps (only pair two wires first, which gives their values and the values of the other two up to a permutation, then pair one of each group). Now, in general,  it seems that, by attaching the wires two by two, one can successively cut the number of wires by two at each trip. This is due to the identification of pairs: if wires a and b are connected in one step, identifying the other end of a will automatically determine the other end of b if one keeps track of all connected pairs at the top. Hence, if $N=2^k$, in (k-2) steps, we are down to 4 wires and it takes an extra 2 steps to identify those 4, hence all the other wires. Which leads to identification in N trips. If N is not a power of 2, things should go faster as “left over” wires in the binary division can be paired with set-aside wires and hence identify 3 wires on that trip, then further along the way… Which leads to a $2^{\lfloor \log_2(N) \rfloor}$ number of trips maybe (or even less). Continue reading

## chain event graphs [RSS Midlands seminar]

Posted in pictures, Statistics, University life with tags , , , , , , , , , , on October 16, 2013 by xi'an

Last evening, I attended the RSS Midlands seminar here in Warwick. The theme was chain event graphs (CEG), As I knew nothing about them, it was worth my time listening to both speakers and discussing with Jim Smith afterwards. CEGs are extensions of Bayes nets with originally many more nodes since they start with the probability tree involving all modalities of all variables. Intensive Bayesian model comparison is then used to reduce the number of nodes by merging modalities having the same children or removing variables with no impact on the variable of interest. So this is not exactly a new Bayes net based on modality dummies as nodes (my original question). This is quite interesting, esp. in the first talk illustration of using missing value indicators as a supplementary variable (to determine whether or not data is missing at random). I also wonder how much of a connection there is with variable length Markov chains (either as a model or as a way to prune the tree). A last vague idea is a potential connection with lumpable Markov chains, a concept I learned from Kemeny & Snell (1960): a finite Markov chain is lumpable if by merging two or more of its states it remains a Markov chain. I do not know if this has ever been studied from a statistical point of view, i.e. testing for lumpability, but this sounds related to the idea of merging modalities of some variables in the probability tree…

## Medical illuminations [book review]

Posted in Books, pictures, Statistics with tags , , , , on September 27, 2013 by xi'an

Howard Wainer wrote another book, about to be published by Oxford University Press, called Medical Illuminations. (The book is announced for January 2 on amazon. A great New Year gift to be sure!) While I attended WSC 2013 in Hong Kong and then again at the RSS Annual Conference in Newcastle, I saw a preliminary copy of the book and asked the representative of OUP if I could get a copy for CHANCE (by any chance?!)… And they kindly sent me a copy the next day!

“This is an odd book (…) gallop[ing] off in all directions at once.” (p.152)

As can be seen from the cover, which reproduces the great da Vinci’s notebook page above (and seen also from the title where illuminations flirts with illuminated [manuscript]), the book focus on visualisation of medical data to “improve healthcare”. Its other themes are using evidence and statistical thinking towards the same goal. Since I was most impressed by the graphical part, I first thought of entitling the post as “Howard does his Tufte (before wondering at the appropriateness of such a title)!

“As hard as this may be to believe, this display is not notably worse than many of the others containd in this remarkable volume.” (p.78)

In fact, this first section is very much related with CHANCE in that a large sequence of graphics were submitted by CHANCE readers when Howard Wainer launched a competition in the magazine for improving upon a Nightingale-like representation by Burtin of antibiotics efficiency. It starts from a administrative ruling that the New York State Health Department had to publish cancer maps overlayed with potentially hazardous sites without any (interpretation) buffer. From there, Wainer shows how the best as well as the worst can be made of graphical representations of statistical data. It reproduces (with due mention) Tufte‘s selection of Minard‘s rendering of the Napoleonic Russian campaign as the best graph ever… The corresponding chapters of the book keep their focus on medical data, with some commentaries on the graphical quality of the 2008 National Healthcare Quality Report (ans.: could do better!). While this is well-done and with a significant message, I would still favour Tufte for teaching data users to present their findings in the most effective way. An interesting final chapter for the section is about “controlling creativity” where Howard Wainer follows in the steps of John Tukey about the Atlas of United States Mortality, And then shows a perfectly incomprehensible chart taken from Understanding USA, a not very premonitory title… (Besides Howard’s conclusion quoted above, you should also read the one-star comments on amazon!)

“Of course, it is impossible to underestimate the graphical skills of the mass media.” (p.164)

Section II is about a better use of statistics and of communicating those statistics towards improving healthcare, from fighting diabetes, to picking the right treatment for hip fractures (from an X-ray),  to re-evaluate detection tests (for breast and prostate cancers) as possibly very inefficient, and to briefly wonder about accelerated testing. And Section III tries to explain why progress (by applying the previous recommendations) has not been more steady. It starts with a story about the use of check-lists in intensive care and the dramatic impact on their effectiveness against infections. (The story hit home as I lost my thumb due to an infection while in intensive care! Maybe a check-list would have helped. Maybe.)  The next chapter contrasts the lack of progress in using check-lists with the adoption of the Korean alphabet in Korea, a wee unrelated example given the focus of the book. (Overall, I find most of the final chapters on the weak side of the book.)

This is indeed an odd book, with a lot of clever remarks and useful insights, but not so much with a driving line that would have made Wainer’s Medical Illuminations more than the sum of its components. Each section and most chapters (!) contain sensible recommendations for improving the presentation and exploitation of medical data towards practitioners and patients. I however wonder how much the book can impact the current state of affairs, like producing better tools for monitoring one’s own diabetes. So, in the end, I recommend the reading of Medical Illuminations as a very pleasant moment, from which examples and anecdotes can be borrowed for courses and friendly discussions. For non-statisticians, it is certainly a worthy entry on the relevance of statistical processing of (raw) data.

## random sudokus

Posted in Books, R, Statistics with tags , , , on June 4, 2013 by xi'an

In a paper arXived on Friday, Roberto Fontana relates the generation of Sudoku grids to the one of Latin squares (which is unsurprising) and to maximum cliques of a graph (more surprising). The generation of a random Latin square proceeds in three steps:

1. generate a random Latin square L with identity permutation matrix on symbol 1 (in practice, this implies building the corresponding graph and picking one of the largest cliques at random);
2. modify L into L’ using a random permutation of the symbols 2,…,n in L’;
3. modify L’ into L” by a random permutation of the columns of L’.

A similar result holds for Sudokus (with the additional constraint on the regions). However, while the result is interesting in its own right, it only covers full Sudokus, rather than partially filled Sudokus with a unique solution, whose random production could be more relevant. (Or maybe not, given that the difficulty matters.) [The code uses some R packages, but then moves to SAS, rather surprisingly.]