Archive for topology

the DeepMind debacle

Posted in Books, Statistics, Travel with tags , , , , , , , , on August 19, 2017 by xi'an

“I hope for a world where data is at the heart of understanding and decision making. To achieve this we need better public dialogue.” Hetan Shah

As I was reading one of the Nature issues I brought on vacations, while the rain was falling on an aborted hiking day on the fringes of Monte Rosa, I came across a 20 July tribune by Hetan Shah, executive director of the RSS. A rare occurrence of a statistician’s perspective in Nature. The event prompting this column is the ruling against the Royal Free London hospital group providing patient data to DeepMind for predicting kidney. Without the patients’ agreement. And with enough information to identify the patients. The issues raised by Hetan Shah are that data transfers should become open, and that they should be commensurate in volume and details to the intended goals. And that public approval should be seeked. While I know nothing about this specific case, I find the article overly critical of DeepMind, which interest in health related problems is certainly not pure and disinterested but nonetheless can contribute advances in (personalised) care and prevention through its expertise in machine learning. (Disclaimer: I have neither connection nor conflict with the company!) And I do not see exactly how public approval or dialogue can help in making progress in handling data, unless I am mistaken in my understanding of “the public”. The article mentions the launch of a UK project on data ethics, involving several [public] institutions like the RSS: this is certainly commandable and may improve personal data is handled by companies, but I would not call this conglomerate representative of the public, which most likely does not really trust these institutions either…

Topological sensitivity analysis for systems biology

Posted in Books, Statistics, Travel, University life with tags , , , , , , on December 17, 2014 by xi'an

Michael Stumpf sent me Topological sensitivity analysis for systems biology, written by Ann Babtie and Paul Kirk,  en avant-première before it came out in PNAS and I read it during the trip to NIPS in Montréal. (The paper is published in open access, so everyone can read it now!) The topic is quite central to a lot of debates about climate change, economics, ecology, finance, &tc., namely to assess the impact of using the wrong model to draw conclusions and make decisions about a real phenomenon. (Which reminded me of the distinction between mechanical and phenomenological models stressed by Michael Blum in his NIPS talk.) And it is of much interest from a Bayesian point of view since assessing the worth of a model requires modelling the “outside” of a model, using for instance Gaussian processes as in the talk Tony O’Hagan gave in Warwick earlier this term. I would even go as far as saying that the issue of assessing [and compensating for] how wrong a model is, given available data, may be the (single) most under-assessed issue in statistics. We (statisticians) have yet to reach our Boxian era.

In Babtie et al., the space or universe of models is represented by network topologies, each defining the set of “parents” in a semi-Markov representation of the (dynamic) model. At which stage Gaussian processes are also called for help. Alternative models are ranked in terms of fit according to a distance between simulated data from the original model (sounds like a form of ABC?!). Obviously, there is a limitation in the number and variety of models considered this way, I mean there are still assumptions made on the possible models, while this number of models is increasing quickly with the number of nodes. As pointed out in the paper (see, e.g., Fig.4), the method has a parametric bootstrap flavour, to some extent.

What is unclear is how one can conduct Bayesian inference with such a collection of models. Unless all models share the same “real” parameters, which sounds unlikely. The paper mentions using uniform prior on all parameters, but this is difficult to advocate in a general setting. Another point concerns the quantification of how much one can trust a given model, since it does not seem models are penalised by a prior probability. Hence they all are treated identically. This is a limitation of the approach (or an indication that it is only a preliminary step in the evaluation of models) in that some models within a large enough collection will eventually provide an estimate that differs from those produced by the other models. So the assessment may become altogether highly pessimistic for this very reason.

“If our parameters have a real, biophysical interpretation, we therefore need to be very careful not to assert that we know the true values of these quantities in the underlying system, just because–for a given model–we can pin them down with relative certainty.”

In addition to its relevance for moving towards approximate models and approximate inference, and in continuation of yesterday’s theme, the paper calls for nested sampling to generate samples from the posterior(s) and to compute the evidence associated with each model. (I realised I had missed this earlier paper by Michael and co-authors on nested sampling for system biology.) There is no discussion in the paper on why nested sampling was selected, compared with, say, a random walk Metropolis-Hastings algorithm. Unless it is used in a fully automated way,  but the paper is rather terse on that issue… And running either approach on 10⁷ models in comparison sounds like an awful lot of work!!! Using importance [sampling] nested sampling as we proposed with Nicolas Chopin could be a way to speed up this exploration if all parameters are identical between all or most models.