Great poster session yesterday night and at lunch today. Saw an ABC poster (by Dennis Prangle, following our random forest paper) and several MCMC posters (by Marco Banterle, who actually won one of the speed-meeting mini-project awards!, Michael Betancourt, Anne-Marie Lyne, Murray Pollock), and then a rather different poster on Mondrian forests, that generalise random forests to sequential data (by Balaji Lakshminarayanan). The talks all had interesting aspects or glimpses about big data and some of the unnecessary hype about it (them?!), along with exposing the nefarious views of Amazon to become the Earth only seller!, but I particularly enjoyed the astronomy afternoon and even more particularly Steve Roberts sweep through astronomy machine-learning. Steve characterised variational Bayes as picking your choice of sufficient statistics, which made me wonder why there were no stronger connections between variational Bayes and ABC. He also quoted the book The Fourth Paradigm: Data-Intensive Scientific Discovery by Tony Hey as putting forward interesting notions. (A book review for the next vacations?!) And also mentioned zooniverse, a citizens science website I was not aware of. With a Bayesian analysis of the learning curve of those annotating citizens (in the case of supernovae classification). Big deal, indeed!!!
Archive for astronomy
“The facts that the thick-disc episode lasted for several billion years, that a contraction is observed during the collapse phase, and that the main thick disc has a constant scale height with no flare argue against the formation of the thick disc through radial migration. The most probable scenario for the thick disc is that it formed while the Galaxy was gravitationally collapsing from well-mixed gas-rich giant clumps that were sustained by high turbulence, which prevented a thin disc from forming for a time, as proposed previously.”
Following discussions with astronomers from Besancon on the use of ABC methods to approximate posteriors, I was associated with their paper on assessing a formation scenario of the Milky Way, which was accepted a few weeks ago in Astronomy & Astrophysics. The central problem (was there a thin-then-thick disk?) somewhat escapes me, but this collaboration started when some of the astronomers leading the study contacted me about convergence issues with their MCMC algorithms and I realised they were using ABC-MCMC without any idea that it was in fact called ABC-MCMC and had been studied previously in another corner of the literature… The scale in the kernel was chosen to achieve an average acceptance rate of 5%-10%. Model are then compared by the combination of a log-likelihood approximation resulting from the ABC modelling and of a BIC ranking of the models. (Incidentally, I was impressed at the number of papers published in Astronomy & Astrophysics. The monthly issue contains dozens of papers!)
Following my earlier post about the young astronomer who feared he was running his MCMC for too long, here is an update from his visit to my office this morning. This visit proved quite an instructive visit for both of us. (Disclaimer: the picture of an observatory seen from across Brunel’s suspension bridge in Bristol is as earlier completely unrelated with the young astronomer!)
First, the reason why he thought MCMC was running too long was that the acceptance rate was plummeting down to zero, whatever the random walk scale. The reason for this behaviour is that he was actually running a standard simulated annealing algorithm, hence observing the stabilisation of the Markov chain in one of the (global) modes of the target function. In that sense, he was right that the MCMC was run for “too long”, as there was nothing to expect once the mode had been reached and the temperature turned down to zero. So the algorithm was working correctly.
Second, the astronomy problem he considers had a rather complex likelihood, for which he substituted a distance between the (discretised) observed data and (discretised) simulated data, simulated conditional on the current parameter value. Now…does this ring a bell? If not, here is a three letter clue: ABC… Indeed, the trick he had found to get around this likelihood calculation issue was to re-invent a version of ABC-MCMC! Except that the distance was re-introduced into a regular MCMC scheme as a substitute to the log-likelihood. And compared with the distance at the previous MCMC iteration. This is quite clever, even though this substitution suffers from a normalisation issue (that I already mentioned in the post about Holmes’ and Walker’s idea to turn loss functions into pseudo likelihoods. Regular ABC does not encounter this difficult, obviously. I am still bemused by this reinvention of ABC from scratch!
So we are now at a stage where my young friend will experiment with (hopefully) correct ABC steps, trying to derive the tolerance value from warmup simulations and use some of the accelerating tricks suggested by Umberto Picchini and Julie Forman to avoid simulating the characteristics of millions of stars for nothing. And we agreed to meet soon for an update. Indeed, a fairly profitable morning for both of us!
Here is an excerpt from an email I just received from a young astronomer with whom I have had many email exchanges about the nature and implementation of MCMC algorithms, not making my point apparently:
The acceptance ratio turn to be good if I used (imposed by me) smaller numbers of iterations. What I think I am doing wrong is the convergence criteria. I am not stopping when I should stop.
To which I replied he should come (or Skype) and talk with me as I cannot get into enough details to point out his analysis is wrong… It may be the case that the MCMC algorithm finds a first mode, explores its neighbourhood (hence a good acceptance rate and green signals for convergence), then wanders away, attracted by other modes. It may also be the case the code has mistakes. Anyway, you cannot stop a (basic) MCMC algorithm too late or let it run for too long! (Disclaimer: the picture of an observatory seen from across Brunel’s suspension bridge in Bristol is unrelated to the young astronomer!)
Today, I made a quick TGV trip to Besançon, in French Jura, to give a seminar to astronomers and physicists, in connection with the Gaia project I had mentioned earlier. I gave my talk straight out of the train and then we started discussing MCMC and ABC for the astronomy problems my guests face. To my surprise, I discovered that they do run some local form of ABC, using their own statistics and distances to validate simulation from the (uniform) prior on their parameter space. The discussion went far enough to take a peek under the hood, namely to look at some Fortran programs they are running (and make suggestions for acceleration and adaptation). It is quite interesting to see that ABC is actually a natural approach when people face complex likelihoods and that, while they construct appropriate tools, they feel somehow uncertain about the validation of those methods and are unaware of very similar tools in other fields. In addition to this great day of exchange, I had several hours of freedom in the train (and a plug) to work on the bayess package for Bayesian Essentials (not dead yet!). Here are my slides, pot-pourri of earlier talks. (Including the one on cosmology model choice in Vancouver.)
Today, I attended a meeting at the Paris observatory about the incoming launch of the Gaia satellite and the associated data (mega-)challenges. To borrow from the webpage, “To create the largest and most precise three dimensional chart of our Galaxy by providing unprecedented positional and radial velocity measurements for about one billion stars in our Galaxy and throughout the Local Group.” The amount of data that will be produced by this satellite is staggering: Gaia will take pictures of roughly 1Giga pixels that will be processed both on-board and on Earth, transmitting over five years a pentabyte of data that need to be processed fairly efficiently to be at all useful! The European consortium operating this satellite has planned for specific tasks dedicated to data handling and processing, which is a fabulous opportunity for would-be astrostatisticians! (Unsurprisingly, at least half of the tasks are statistics related, either at the noise reduction stage or at the estimation stage.) Another amazing feature of the project is that it will result in open data, the outcome of the observations being open to everyone for analyse… I am clearly looking forward the next meeting to understand better the structure of the data and the challenges simulation methods could help to solve!
Martin Kilbinger, an astronomer (cosmologist) with whom we had worked on population Monte Carlo for cosmological inference [during the ANR-05-BLAN-0283- 04 ANR ECOSSTAT grant], has made the PMC C codes available on the CosmoPMC webpage. He has also written a CosmoPMC manual that is now available from arXiv. And he very kindly associated me to this publication, even though I never directly contributed to the codes… On a wider perspective, this collaboration between cosmologists and Bayesian and computational statisticians was both fruitful and enjoyable and I hope we can pursue it in the future. A very nice thing about astronomers (among many!) is that they naturally adopt a Bayesian way of thinking about their parameters. This, plus their high math and programming skills, makes the cost of entering a collaboration very low!