When playing with Peter Rossi’s bayesm R package during a visit of Jean-Michel Marin to Paris, last week, we came up with the above Gibbs outcome. The setting is a Gaussian mixture model with three components in dimension 5 and the prior distributions are standard conjugate. In this case, with 500 observations and 5000 Gibbs iterations, the Markov chain (for one component of one mean of the mixture) has two highly distinct regimes: one that revolves around the true value of the parameter, 2.5, and one that explores a much broader area (which is associated with a much smaller value of the component weight). What we found amazing is the Gibbs ability to entertain both regimes, simultaneously.
Archive for Gibbs sampler
A question on Cross Validated led me to realise I had never truly considered the issue of periodic Gibbs samplers! In MCMC, non-aperiodic chains are a minor nuisance in that the skeleton trick of randomly subsampling the Markov chain leads to a aperiodic Markov chain. (The picture relates to the skeleton!) Intuitively, while the systematic Gibbs sampler has a tendency to non-reversibility, it seems difficult to imagine a sequence of full conditionals that would force the chain away from the current value..!In the discrete case, given that the current state of the Markov chain has positive probability for the target distribution, the conditional probabilities are all positive as well and hence the Markov chain can stay at its current value after one Gibbs cycle, with positive probabilities, which means strong aperiodicity. In the continuous case, a similar argument applies by considering a neighbourhood of the current value. (Incidentally, the same person asked a question about the absolute continuity of the Gibbs kernel. Being confused by our chapter on the topic!!!)
Second day at the Indo-French Centre for Applied Mathematics and the workshop. Maybe not the most exciting day in terms of talks (as I missed the first two plenary sessions by (a) oversleeping and (b) running across the campus!). However I had a neat talk with another conference participant that led to [what I think are] interesting questions… (And a very good meal in a local restaurant as the guest house had not booked me for dinner!)
To wit: given a target like
the simulation of λ can be demarginalised into the simulation of
where z is a latent (and artificial) variable. This means a Gibbs sampler simulating λ given z and z given λ can produce an outcome from the target (*). Interestingly, another completion is to consider that the zi‘s are U(0,yi) and to see the quantity
as an unbiased estimator of the target. What’s quite intriguing is that the quantity remains the same but with different motivations: (a) demarginalisation versus unbiasedness and (b) zi ∼ Exp(λ) versus zi ∼ U(0,yi). The stationary is the same, as shown by the graph below, the core distributions are [formally] the same, … but the reasoning deeply differs.
Obviously, since unbiased estimators of the likelihood can be justified by auxiliary variable arguments, this is not in fine a big surprise. Still, I had not thought of the analogy between demarginalisation and unbiased likelihood estimation previously. Continue reading
- Dan Simpson replied to my comments of last Tuesday about his PC construction;
- Arnaud Doucet precised some issues about his adaptive subsampling paper;
- Amandine Schreck clarified why I had missed some points in her Bayesian variable selection paper;
- Randal Douc defended the efficiency of using Carlin and Chib (1995) method for mixture simulation.
Thanks to them for taking the time to answer my musings…
“At equilibrium, we thus should not expect gains of several orders of magnitude.”
As was signaled to me several times during the MCqMC conference in Leuven, Rémi Bardenet, Arnaud Doucet and Chris Holmes (all from Oxford) just wrote a short paper for the proceedings of ICML on a way to speed up Metropolis-Hastings by reducing the number of terms one computes in the likelihood ratio involved in the acceptance probability, i.e.
The observations appearing in this likelihood ratio are a random subsample from the original sample. Even though this leads to an unbiased estimator of the true log-likelihood sum, this approach is not justified on a pseudo-marginal basis à la Andrieu-Roberts (2009). (Writing this in the train back to Paris, I am not convinced this approach is in fact applicable to this proposal as the likelihood itself is not estimated in an unbiased manner…)
In the paper, the quality of the approximation is evaluated by Hoeffding’s like inequalities, which serves as the basis for a stopping rule on the number of terms eventually evaluated in the random subsample. In fine, the method uses a sequential procedure to determine if enough terms are used to take the decision and the probability to take the same decision as with the whole sample is bounded from below. The sequential nature of the algorithm requires to either recompute the vector of likelihood terms for the previous value of the parameter or to store all of them for deriving the partial ratios. While the authors adress the issue of self-evaluating whether or not this complication is worth the effort, I wonder (from my train seat) why they focus so much on recovering the same decision as with the complete likelihood ratio and the same uniform. It would suffice to get the same distribution for the decision (an alternative that is easier to propose than to create of course). I also (idly) wonder if a Gibbs version would be manageable, i.e. by changing only some terms in the likelihood ratio at each iteration, in which case the method could be exact… (I found the above quote quite relevant as, in an alternative technique we are constructing with Marco Banterle, the speedup is particularly visible in the warmup stage.) Hence another direction in this recent flow of papers attempting to speed up MCMC methods against the incoming tsunami of “Big Data” problems.
Last Thursday, I gave a seminar in Nottingham, the true birthplace of the Gibbs sampler!, and I had a quite enjoyable half-day of scientific discussions in the Department of Statistics, with a fine evening tasting a local ale in the oldest (?) inn in England (Ye Olde Trip to Jerusalem) and sampling Indian dishes at 4550 Miles [plus or minus epsilon, since the genuine distance is 4200 miles) from Dehli, plus a short morning run on the very green campus. In particular, I discussed with Theo Kypraios and Simon Preston parallel ABC and their recent paper in Statistics and Computing, their use of the splitting technique of Neiswanger et al. I discussed earlier but intended here towards a better ABC approximation since (a) each term in the product could correspond to a single observation and (b) hence no summary statistic was needed and a zero tolerance could be envisioned. The paper discusses how to handle samples from terms in a product of densities, either by a Gaussian approximation or by a product of kernel estimates. And mentions connections with expectation propagation (EP), albeit not at the ABC level.
A minor idea that came to me during this discussion was to check whether or not a reparameterisation towards a uniform prior was a good idea: the plus of a uniform prior was that the power discussion was irrelevant, making both versions of the parallel MCMC algorithm coincide. The minus was not the computational issue since most priors are from standard families, with easily invertible cdfs, but rather why this was supposed to make a difference. When writing this on the train to Oxford, I started wondering as an ABC implementation is impervious to this reparameterisation. Indeed, simulate θ from π and pseudo-data given θ versus simulate μ from uniform and pseudo-data given T(μ) does not make a difference in the simulated pseudo-sample, hence in the distance selected θ’s, and still in one case the power does not matter while in the other case it does..!
Another discussion I had during my visit led me to conclude a bit hastily that a thesis topic I had suggested to a new PhD student a few months ago had already been considered locally and earlier, although it ended up as a different, more computational than conceptual, perspective (so not all was lost for my student!). In a wider discussion around lunch, we also had an interesting foray on possible alternatives to Bayes factors and their shortcomings, which was a nice preparation to my seminar on giving up posterior probabilities for posterior error estimates. And an opportunity to mention the arXival of a proper scoring rules paper by Phil Dawid, Monica Musio and Laura Ventura, related with the one I had blogged about after the Padova workshop. And then again about a connected paper with Steve Fienberg. This lunch discussion even included some (mild) debate about Murray Aitkin’s integrated likelihood.
As a completely irrelevant aside, this trip gave me the opportunity of a “pilgrimage” to Birmingham New Street train station, 38 years after “landing” for the first time in Britain! And to experience a fresco the multiple delays and apologies of East Midlands trains (“we’re sorry we had to wait for this oil train in York”, “we have lost more time since B’ham”, “running a 37 minutes delay now”, “we apologize for the delay, due to trespassing”, …), the only positive side being that delayed trains made delayed connections possible!
In the plane to Atlanta, I happened to read a paper called Efficient simulation of the Ginibre point process by Laurent Decreusefond, Ian Flint, and Anaïs Vergne (from Telecom Paristech). “Happened to” as it was a conjunction of getting tipped by my new Dauphine colleague (and fellow blogger!) Djalil Chaffaï about the paper, having downloaded it prior to departure, and being stuck in a plane (after watching the only Chinese [somewhat] fantasy movie onboard, Saving General Yang).
This is mostly a mathematics paper. While indeed a large chunk of it is concerned with the rigorous definition of this point process in an abstract space, the last part is about simulating such processes. They are called determinantal (and not detrimental as I was tempted to interpret on my first read!) because the density of an n-set (x1, x2,…,xn) is given by a kind of generalised Vandermonde determinant
where T is defined in terms of an orthonormal family,
(The number n of points can be simulated via an a.s. finite Bernoulli process.) Because of this representation, the sequence of conditional densities for the xi‘s (i.e. x1, x2 given x1, etc.) can be found in closed form. In the special case of the Ginibre process, the ψi‘s are of the form
and the process cannot be simulated for it has infinite mass, hence an a.s. infinite number of points. Somehow surprisingly (as I thought this was the point of the paper), the authors then switch to a truncated version of the process that always has a fixed number N of points. And whose density has the closed form
It has an interestingly repulsive quality in that points cannot get close to one another. (It reminded me of the pinball sampler proposed by Kerrie Mengersen and myself at one of the Valencia meetings and not pursued since.) The conclusion (of this section) is anticlimactic, though, in that it is known that this density also corresponds to the distribution of the eigenvalues of an Hermitian matrix with standardized complex Gaussian entries. The authors mentions that the fact that the support is the whole complex space Cn is a difficulty, although I do not see why.
The following sections of the paper move to the Ginibre process restricted to a compact and then to the truncated Ginibre process restricted to a compact, for which the authors develop corresponding simulation algorithms. There is however a drag in that the sequence of conditionals, while available in closed-form, cannot be simulated efficiently but rely on a uniform accept-reject instead. While I am certainly missing most of the points in the paper, I wonder if a Gibbs sampler would not be an interesting alternative given that the full (last) conditional is a Gaussian density…