Last week, Michael Betancourt, from Warwick, arXived a neat wee note on the fundamental difficulties in running HMC on a subsample of the original data. The core message is that using only one fraction of the data to run an HMC with the hope that it will preserve the stationary distribution does not work. The only way to recover from the bias is to use a Metropolis-Hastings step using the whole data, a step that both kills most of the computing gain and has very low acceptance probabilities. Even the strategy that subsamples for each step in a single trajectory fails: there cannot be a significant gain in time without a significant bias in the outcome. Too bad..! Now, there are ways of accelerating HMC, for instance by parallelising the computation of gradients but, just as in any other approach (?), the information provided by the whole data is only available when looking at the whole data.
Archive for University of Warwick
The University of Warwick is one of the five UK Universities (Cambridge, Edinburgh, Oxford, Warwick and UCL) to be part of the new Alan Turing Institute.To quote from the University press release, “The Institute will build on the UK’s existing academic strengths and help position the country as a world leader in the analysis and application of big data and algorithm research. Its headquarters will be based at the British Library at the centre of London’s Knowledge Quarter.” The Institute will gather researchers from mathematics, statistics, computer sciences, and connected fields towards collegial and focussed research , which means in particular that it will hire a fairly large number of researchers in stats and machine-learning in the coming months. The Department of Statistics at Warwick was strongly involved in answering the call for the Institute and my friend and colleague Mark Girolami will the University leading figure at the Institute, alas meaning that we will meet even less frequently! Note that the call for the Chair of the Alan Turing Institute is now open, with deadline on March 15. [As a personal aside, I find the recognition that
Alan Turing’s genius played a pivotal role in cracking the codes that helped us win the Second World War. It is therefore only right that our country’s top universities are chosen to lead this new institute named in his honour. by the Business Secretary does not absolve the legal system that drove Turing to suicide….]
With my friends Peter Green (Bristol), Krzysztof Łatuszyński (Warwick) and Marcello Pereyra (Bristol), we just arXived the first version of “Bayesian computation: a perspective on the current state, and sampling backwards and forwards”, which first title was the title of this post. This is a survey of our own perspective on Bayesian computation, from what occurred in the last 25 years [a lot!] to what could occur in the near future [a lot as well!]. Submitted to Statistics and Computing towards the special 25th anniversary issue, as announced in an earlier post.. Pulling strength and breadth from each other’s opinion, we have certainly attained more than the sum of our initial respective contributions, but we are welcoming comments about bits and pieces of importance that we miss and even more about promising new directions that are not posted in this survey. (A warning that is should go with most of my surveys is that my input in this paper will not differ by a large margin from ideas expressed here or in previous surveys.)
Reading Significance is always an enjoyable moment, when I can find time to skim through the articles (before my wife gets hold of it!). This time, I lost my copy between my office and home, and borrowed it from Tom Nichols at Warwick with four mornings to read it during breakfast. This December issue is definitely interesting, as it contains several introduction articles on astro- and cosmo-statistics! One thing I had not noticed before is how a large fraction of the papers is written by authors of books, giving a quick entry or interview about their book. For instance, I found out that Roberto Trotta had written a general public book called the Edge of the Sky (All You Need to Know About the All-There-Is) which exposes the fundamentals of cosmology through the 1000 most common words in the English Language.. So Universe is replaced with All-There-Is! I can understand and to some extent applaud the intention, but it nonetheless makes for a painful read, judging from the excerpt, when researcher and telescope are not part of the accepted vocabulary. Reading the corresponding article in Significance let me a bit bemused at the reason provided for the existence of a multiverse, i.e., of multiple replicas of our universe, all with different conditions: multiplying the universes makes our more likely, while it sounds almost impossible on its own! This sounds like a very frequentist argument… and I am not even certain it would convince a frequentist. The other articles in this special astrostatistics section were of a more statistical nature, from estimating the number of galaxies to the chances of a big asteroid impact. Even though I found the graphical representation of the meteorite impacts in the past century because of the impact drawing in the background. However, when I checked the link to Carlo Zapponi’s website, I found the picture was a still of a neat animation of meteorites falling since the first report.
As I could not book my “usual” maths house on the campus of the University of Warwick, I searched for another accommodation and discovered a nice shared house in the countryside (next to my standard running route), run by the Warwick Institute of Advanced Study, and called Cryfield Grange. As seen from the pictures, the building itself is impressive, even though there is not much left inside of its Tudor foundations, except some unexpected steps in the middle of some rooms and a few remaining black beams; it is also quite enjoyable for a week visit, with a large kitchen where I made rice pudding and pissaladière for the whole week, and a bike path to the University. I will definitely try to get there in the summer, as it must be even more enjoyable!
Here is the fifth instalment in the Peter Grant (or Rivers of London) series by Ben Aaronovitch. Thus entitled Foxglove summer, which meaning only became clear (to me) by the end of the book. I found it in my mailbox upon arrival in Warwick last Sunday. And rushed through the book during evenings, insomnia breaks and even a few breakfasts!
“It’s observable but not reliably observable. It can have a quantifiable effects, but resists any attempt to apply mathematical principles to it – no wonder Newton kept magic under wraps. It must have driven him mental. Or maybe not.” (p.297)
Either because the author has run out of ideas to centre a fifth novel on a part or aspect of London (even though the parks, including the London Zoo, were not particularly used in the previous novels), or because he could not set this new type of supernatural in a city (no spoilers!), this sequel takes place in the Western Counties, close to the Welsh border (and not so far from Brother Cadfael‘s Shrewbury!). It is also an opportunity to introduce brand new (local) characters which are enjoyable if a wee bit of a caricature! However, the inhabitants of the small village where the kidnapping investigation takes place are almost too sophisticated for Peter Grant who has to handle the enquiry all by himself, as his mentor is immobilised in London by the defection of Peter’s close colleague, Lindsey.
“We trooped off (…) down something that was not so much a path as a statistical variation in the density of the overgrowth.” (p.61)
As usual, the dialogues and monologues of Grant are the most enjoyable part of the story, along with a development of the long-in-the-coming love affair with the river goddess Beverley Brooks. And a much appreciated ambiguity in the attitude of Peter about the runaway Lindsey… The story itself reflects the limitations of a small village where one quickly repeats over and over the same trips and the same relations. Which gives a sensation of slow motion, even in the most exciting moments. The resolution of the enigma is borrowing too heavily to the fae and elves folklore, even though the final pages bring a few surprises. Nonetheless, the whole book was a page-turner for me, meaning I spent more time reading it this week than I intended or than was reasonable. No wonder for a series taking place in The Folly!
Another short paper about relabelling in mixtures was arXived last week by Pauli and Torelli. They refer rather extensively to a previous paper by Puolamäki and Kaski (2009) of which I was not aware, paper attempting to get an unswitching sampler that does not exhibit any label switching, a concept I find most curious as I see no rigorous way to state that a sampler is not switching! This would imply spotting low posterior probability regions that the chain would cross. But I should check the paper nonetheless.
Because the G component mixture posterior is invariant under the G! possible permutations, I am somewhat undeciced as to what the authors of the current paper mean by estimating the difference between two means, like μ1-μ2. Since they object to using the output of a perfectly mixing MCMC algorithm and seem to prefer the one associated with a non-switching chain. Or by estimating the probability that a given observation is from a given component, since this is exactly 1/G by the permutation invariance property. In order to identify a partition of the data, they introduce a loss function on the joint allocations of pairs of observations, loss function that sounds quite similar to the one we used in our 2000 JASA paper on the label switching deficiencies of MCMC algorithms. (And makes me wonder why this work of us is not deemed relevant for the approach advocated in the paper!) Still, having read this paper, which I find rather poorly written, I have no clear understanding of how the authors give a precise meaning to a specific component of the mixture distribution. Or how the relabelling has to be conducted to avoid switching. That is, how the authors define their parameter space. Or their loss function. Unless one falls back onto the ordering of the means or the weights which has the drawback of not connecting with the levels sets of a particular mode of the posterior distribution, meaning that imposing the constraints result in a region that contains bits of several modes.
At some point the authors assume the data can be partitioned into K≤G groups such that there is a representative observation within each group never sharing a component (across MCMC iterations) with any of the other representatives. While this notion is label invariant, I wonder whether (a) this is possible on any MCMC outcome; (b) it indicates a positive or negative feature of the MCMC sampler.; and (c) what prevents the representatives to switch in harmony from one component to the next while preserving their perfect mutual exclusion… This however constitutes the advance in the paper, namely that component dependent quantities as estimated as those associated with a particular representative. Note that the paper contains no illustration, hence that the method may prove hard to impossible to implement!