I had been privileged to have a look at a preliminary version of the now-published retrospective written by Mike Titterington on the 100 first issues of Biometrika (more exactly, “from volume 28 onwards“, as the title state). Mike was the dedicated editor of Biometrika for many years and edited a nice book for the 100th anniversary of the journal. He started from the 100th most highly cited papers within the journal to build a coherent chronological coverage. From a Bayesian perspective, this retrospective starts with Maurice Kendall trying to reconcile frequentists and non-frequentists in 1949, while having a hard time with fiducial statistics. Then Dennis Lindley makes it to the top 100 in 1957 with the Lindley-Jeffreys paradox. From 1958 till 1961, Darroch is quoted several times for his (fine) formalisation of the capture-recapture experiments we were to study much later (Biometrika, 1992) with Ed George… In the 1960′s, Bayesian papers became more visible, including Don Fraser (1961) and Arthur Dempster’ Demspter-Shafer theory of evidence, as well as George Box and co-authors (1965, 1968) and Arnold Zellner (1964). Keith Hastings’ 1970 paper stands as the fifth most highly cited paper, even though it was ignored for almost two decades. The number of Bayesian papers kept increasing. including Binder’s (1978) cluster estimation, Efron and Morris’ (1972) James-Stein estimators, and Efron and Thisted’s (1978) terrific evaluation of Shakespeare’s vocabulary. From then, the number of Bayesian papers gets too large to cover in its entirety. The 1980′s saw papers by Julian Besag (1977, 1989, 1989 with Peter Clifford, which was yet another precursor MCMC) and Luke Tierney’s work (1989) on Laplace approximation. Carter and Kohn’s (1994) MCMC algorithm on state space models made it to the top 40, while Peter Green’s (1995) reversible jump algorithm came close to Hastings’ (1970) record, being the 8th most highly cited paper. Since the more recent papers do not make it to the top 100 list, Mike Titterington’s coverage gets more exhaustive as the years draw near, with an almost complete coverage for the final years. Overall, a fascinating journey through the years and the reasons why Biometrika is such a great journal and constantly so.
Archive for MCMC algorithms
Here is an email I received a few days ago, similar to many other emails I/we receive on a regular basis:
I am working on Markov Chain Monte Carlo methods as part of my Masters project. I have to estimate mean, variance from a Gaussian mixture using metropolis method. I came across your paper ‘Bayesian Modelling and Inference on Mixtures of Distributions’. I am unable to understand how to obtain the new sample for mean, variance etc… I am using uniform distribution as proposal distribution. Should it be random numbers for the proposal distribution.
I have been working and trying to understand this for a long time. I would be grateful for any help.
While I felt sorry for the Master student, I consider it is the responsibility of his/her advisor to give her/him the proper directions for understanding the paper. (Given the contents of the email, it sounds as if the student would require proper training in both Bayesian statistics [uniform priors on unbounded parameters?] and simulation [the question about random numbers does not make sense]…) This is what I replied to the student, hopefully in a positive tone.
Larry Wasserman wrote a blog entry on the normalizing constant paradox, where he repeats that he does not understand my earlier point…Let me try to recap here this point and the various comments I made on StackExchange (while keeping in mind all this is for intellectual fun!)
The entry is somehow paradoxical in that Larry acknowledges (in that post) that the analysis in his book, All of Statistics, is wrong. The fact that “g(x)/c is a valid density only for one value of c” (and hence cannot lead to a notion of likelihood on c) is the very reason why I stated that there can be no statistical inference nor prior distribution about c: a sample from f does not bring statistical information about c and there can be no statistical estimate of c based on this sample. (In case you did not notice, I insist upon statistical!)
To me this problem is completely different from a statistical problem, at least in the modern sense: if I need to approximate the constant c—as I do in fact when computing Bayes factors—, I can produce an arbitrarily long sample from a certain importance distribution and derive a converging (and sometimes unbiased) approximation of c. Once again, this is Monte Carlo integration, a numerical technique based on the Law of Large Numbers and the stabilisation of frequencies. (Call it a frequentist method if you wish. I completely agree that MCMC methods are inherently frequentist in that sense, And see no problem with this because they are not statistical methods. Of course, this may be the core of the disagreement with Larry and others, that they call statistics the Law of Large Numbers, and I do not. This lack of separation between both notions also shows up in a recent general public talk on Poincaré’s mistakes by Cédric Villani! All this may just mean I am irremediably Bayesian, seeing anything motivated by frequencies as non-statistical!) But that process does not mean that c can take a range of values that would index a family of densities compatible with a given sample. In this Monte Carlo integration approach, the distribution of the sample is completely under control (modulo the errors induced by pseudo-random generation). This approach is therefore outside the realm of Bayesian analysis “that puts distributions on fixed but unknown constants”, because those unknown constants parameterise the distribution of an observed sample. Ergo, c is not a parameter of the sample and the sample Larry argues about (“we have data sampled from a distribution”) contains no information whatsoever about c that is not already in the function g. (It is not “data” in this respect, but a stochastic sequence that can be used for approximation purposes.) Which gets me back to my first argument, namely that c is known (and at the same time difficult or impossible to compute)!
Let me also answer here the comments on “why is this any different from estimating the speed of light c?” “why can’t you do this with the 100th digit of π?” on the earlier post or on StackExchange. Estimating the speed of light means for me (who repeatedly flunked Physics exams after leaving high school!) that we have a physical experiment that measures the speed of light (as the original one by Rœmer at the Observatoire de Paris I visited earlier last week) and that the statistical analysis infers about c by using those measurements and the impact of the imprecision of the measuring instruments (as we do when analysing astronomical data). If, now, there exists a physical formula of the kind
where φ is a probability density, I can imagine stochastic approximations of c based on this formula, but I do not consider it a statistical problem any longer. The case is thus clearer for the 100th digit of π: it is also a fixed number, that I can approximate by a stochastic experiment but on which I cannot attach a statistical tag. (It is 9, by the way.) Throwing darts at random as I did during my Oz tour is not a statistical procedure, but simple Monte Carlo à la Buffon…
Overall, I still do not see this as a paradox for our field (and certainly not as a critique of Bayesian analysis), because there is no reason a statistical technique should be able to address any and every numerical problem. (Once again, Persi Diaconis would almost certainly differ, as he defended a Bayesian perspective on numerical analysis in the early days of MCMC…) There may be a “Bayesian” solution to this particular problem (and that would nice) and there may be none (and that would be OK too!), but I am not even convinced I would call this solution “Bayesian”! (Again, let us remember this is mostly for intellectual fun!)
Here are the slides for the second day of my course at Monash University, Melbourne, in the Special Lectures in Econometrics, with a strong strong similarity with the slides of my course in Roma this Spring. (Ah, sunny Roma…) The first day lecture was very well attended and I hope this remains true for the second! (I also think I should spend more time on particle filters in general, the next time I give a similar course…)