Archive for bioinformatics

ABC’ory in Banff [17w5025]

Posted in Books, Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , on February 27, 2017 by xi'an

And another exciting and animated [last] day of ABC’ory [and practice]!  Kyle Cranmer exposed a density ratio density estimation approach I had not seen before [and will comment here soon]. Patrick Muchmore talked about unbiased estimators of Gaussian and non-Gaussian densities in elliptically contoured distributions which allows for running pseudo-MCMC than ABC. This reminded me of using the same tool [for those distributions can be expressed as mixtures of normals] in my PhD thesis, if for completely different purposes. In his talk, including a presentation of an ABC blackbox platform called ELFI, Samuel Kaski did translate statistical inference as inverse reinforcement learning: I hope this does not catch! In the afternoon, Dennis Prangle gave us the intuition behind his rare event ABC, which is not estimating rare events by ABC but rather using rare event simulation to improve ABC. [A paper I will a.s. comment here soon as well!] And Scott Sisson concluded the day and the week with his views on ABC for high dimensions.

While being obviously biased as the organiser of the workshop, I nonetheless feel it was a wonderful meeting with just the right number of participants to induce interactions and discussions during and around the talk, as well as preserve some time for pairwise interactions. Like all other workshops I contributed to in BIRS along the years

07w5079 2007-07-01 Bioinformatics, Genetics and Stochastic Computation: Bridging the Gap
10w2170 2010-09-10 Hierarchical Bayesian Methods in Ecology
14w5125 2014-03-02 Advances in Scalable Bayesian Computation

this is certainly a highly profitable one! For a [major] change, the next one [18w5023] will take place in Oaxaca, Mexico, and will see computational statistics meet molecular simulation. [As an aside, here are the first and last slides of Ewan Cameron’s talk, appropriately illustrating beginning and end, for both themes of his talk: epidemiology and astronomy!]




new version of abcrf

Posted in R, Statistics, University life with tags , , , , , , on February 12, 2016 by xi'an
fig-tree near Brisbane, Australia, Aug. 18, 2012Version 1.1 of our R library abcrf version 1.1  is now available on CRAN.  Improvements against the earlier version are numerous and substantial. In particular,  calculations of the random forests have been parallelised and, for machines with multiple cores, the computing gain can be enormous. (The package does along with the random forest model choice paper published in Bioinformatics.)

ABC model choice via random forests accepted!

Posted in Books, pictures, Statistics, University life with tags , , , , , on October 21, 2015 by xi'an

treerise6“This revision represents a very nice response to the earlier round of reviews, including a significant extension in which the posterior probability of the selected model is now estimated (whereas previously this was not included). The extension is a very nice one, and I am happy to see it included.” Anonymous

Great news [at least for us], our paper on ABC model choice has been accepted by Bioninformatics! With the pleasant comment above from one anonymous referee. This occurs after quite a prolonged gestation, which actually contributed to a shift in our understanding and our implementation of the method. I am still a wee bit unhappy at the rejection by PNAS, but it paradoxically led to a more elaborate article. So all is well that ends well! Except the story is not finished and we have still exploring the multiple usages of random forests in ABC.

abcfr 0.9-3

Posted in R, Statistics, University life with tags , , , , , , , , on August 27, 2015 by xi'an

garden tree, Jan. 12, 2012In conjunction with our reliable ABC model choice via random forest paper, about to be resubmitted to Bioinformatics, we have contributed an R package called abcrf that produces a most likely model and its posterior probability out of an ABC reference table. In conjunction with the realisation that we could devise an approximation to the (ABC) posterior probability using a secondary random forest. “We” meaning Jean-Michel Marin and Pierre Pudlo, as I only acted as a beta tester!

abcrfThe package abcrf consists of three functions:

  • abcrf, which constructs a random forest from a reference table and returns an object of class `abc-rf’;
  • plot.abcrf, which gives both variable importance plot of a model choice abc-rf object and the projection of the reference table on the LDA axes;
  • predict.abcrf, which predict the model for new data and evaluate the posterior probability of the MAP.

An illustration from the manual:

mc.rf <- abcrf(snp[1:1e3, 1], snp[1:1e3, -1])
predict(mc.rf, snp[1:1e3, -1], snp.obs)

nested sampling for systems biology

Posted in Books, Statistics, University life with tags , , , , on January 14, 2015 by xi'an

In conjunction with the recent PNAS paper on massive model choice, Rob Johnson†, Paul Kirk and Michael Stumpf published in Bioinformatics an implementation of nested sampling that is designed for biological applications, called SYSBIONS. Hence the NS for nested sampling! The C software is available on-line. (I had planned to post this news next to my earlier comments but it went under the radar…)

bioinformatics workshop at Pasteur

Posted in Books, Statistics, University life with tags , , , , on September 23, 2013 by xi'an

Once again, I (did) find myself attending lectures on a Monday! This time, it was at the Institut Pasteur, (where I did not spot any mention of Alexandre Yersin) in the bioinformatics unit, around Bayesian methods in computational biology. The workshop was organised by Michael Nilges and the program started as follows:

9:10 AM Michael Habeck (MPI Göttingen) Bayesian methods for cryo-EM
9:50 AM John Chodera (Sloan-Kettering research institute) Toward Bayesian inference of conformational distributions, analysis of isothermal titration calorimetry experiments, and forcefield parameters
11:00 AM Jeff Hoch (University of Connecticut Health Center) Haldane, Bayes, and Reproducible Research: Bedrock Principles for the Era of Big  Data
11:40 AM Martin Weigt (UPMC Paris) Direct-Coupling Analysis: From residue co-evolution to structure prediction
12:20 PM Riccardo Pellarin (UCSF) Modeling the structure of macromolecules using cross-linking data
2:20 PM Frederic Cazals (INRIA Sophia-Antipolis) Coarse-grain Modeling of Large Macro-Molecular Assemblies: Selected Challenges
3:00 PM Yannick Spill (Institut Pasteur) Bayesian Treatment of SAXS Data
3:30 PM Guillaume Bouvier (Institut Pasteur) Clustering protein conformations using Self-Organizing Maps

This is a highly interesting community, from which stemmed many of the MC and MCMC ideas, but I must admit I got lost (in translation) most of the time (and did not attend the workshop till its end), just like when I attended this workshop at the German synchrotron in Hamburg last Spring: some terms and concepts were familiar like Gibbs sampling, Hamiltonian MCMC, HMM modelling, EM steps, maximum entropy priors, reversible jump MCMC, &tc., but the talks were going too fast (for me) and focussed instead on the bio-chemical aspects, like protein folding, entropy-enthalpy, free energy, &tc. So the following comments mostly reflect my being alien to this community…

For instance, I found the talk by John Chodera quite interesting (in a fast-forward high-energy/content manner), but the probabilistic modelling was mostly absent from his slides (and seemed to reduce to a Gaussian likelihood) and the defence of Bayesian statistics sounded a bit like a mantra at times (something like “put a prior on everything you do not know and everything will end up fine with enough simulations”), a feature I once observed in the past with Bayesian ideas coming to a new field (although this hardly seems to be the case here).

All talks I attended mentioned maximum entropy as a way of modelling, apparently a common tool in this domain (as there were too little details for me). For instance, Jeff Hoch’s talk remained at a very general level, referring to a large literature (incl. David Donoho’s) for the advantages of using MaxEnt deconvolution to preserve sensitivity. (The “Haldane” part of his talk was about Haldane —who moved from UCL to the ISI in Calcutta— writing a parody on how to fake genetic data in a convincing manner. And showing the above picture.) Although he linked them with MaxEnt principles, Martin Weigt’s talk was about Markov random fields modelling contacts between amino acids in the protein, but I could not get how the selection among the huge number of possible models was handled: To me it seemed to amount to estimate a graphical model on the protein, as it also did for my neighbour. (No sign of any ABC processing in the picture.)