When playing with Peter Rossi’s bayesm R package during a visit of Jean-Michel Marin to Paris, last week, we came up with the above Gibbs outcome. The setting is a Gaussian mixture model with three components in dimension 5 and the prior distributions are standard conjugate. In this case, with 500 observations and 5000 Gibbs iterations, the Markov chain (for one component of one mean of the mixture) has two highly distinct regimes: one that revolves around the true value of the parameter, 2.5, and one that explores a much broader area (which is associated with a much smaller value of the component weight). What we found amazing is the Gibbs ability to entertain both regimes, simultaneously.
Archive for Markov chain Monte Carlo
The September issue of [JRSS] Series B I received a few days ago is of particular interest to me. (And not as an ex-co-editor since I was never involved in any of those papers!) To wit: a paper by Hani Doss and Aixin Tan on evaluating normalising constants based on MCMC output, a preliminary version I had seen at a previous JSM meeting, a paper by Nick Polson, James Scott and Jesse Windle on the Bayesian bridge, connected with Nick’s talk in Boston earlier this month, yet another paper by Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar and Michael Jordan on the bag of little bootstraps, which presentation I heard Michael deliver a few times when he was in Paris. (Obviously, this does not imply any negative judgement on the other papers of this issue!)
For instance, Doss and Tan consider the multiple mixture estimator [my wording, the authors do not give the method a name, referring to Vardi (1985) but missing the connection with Owen and Zhou (2000)] of k ratios of normalising constants, namely
where the z’s are the normalising constants and with possible different numbers of iterations of each Markov chain. An interesting starting point (that Hans Künsch had mentioned to me a while ago but that I had since then forgotten) is that the problem was reformulated by Charlie Geyer (1994) as a quasi-likelihood estimation where the ratios of all z’s relative to one reference density are the unknowns. This is doubling interesting, actually, because it restates the constant estimation problem into a statistical light and thus somewhat relates to the infamous “paradox” raised by Larry Wasserman a while ago. The novelty in the paper is (a) to derive an optimal estimator of the ratios of normalising constants in the Markov case, essentially accounting for possibly different lengths of the Markov chains, and (b) to estimate the variance matrix of the ratio estimate by regeneration arguments. A favourite tool of mine, at least theoretically as practically useful minorising conditions are hard to come by, if at all available.
Scott Schmidler and his Ph.D. student Douglas VanDerwerken have arXived a paper on parallel MCMC the very day I left for Chamonix, prior to MCMSki IV, so it is no wonder I missed it at the time. This work is somewhat in the spirit of the parallel papers Scott et al.’s consensus Bayes, Neiswanger et al.’s embarrassingly parallel MCMC, Wang and Dunson’s Weierstrassed MCMC (and even White et al.’s parallel ABC), namely that the computation of the likelihood can be broken into batches and MCMC run over those batches independently. In their short survey of previous works on parallelization, VanDerwerken and Schmidler overlooked our neat (!) JCGS Rao-Blackwellisation with Pierre Jacob and Murray Smith, maybe because it sounds more like post-processing than genuine parallelization (in that it does not speed up the convergence of the chain but rather improves the Monte Carlo usages one can make of this chain), maybe because they did not know of it.
“This approach has two shortcomings: first, it requires a number of independent simulations, and thus processors, equal to the size of the partition; this may grow exponentially in dim(Θ). Second, the rejection often needed for the restriction doesn’t permit easy evaluation of transition kernel densities, required below. In addition, estimating the relative weights wi with which they should be combined requires care.” (p.3)
The idea of the authors is to replace an exploration of the whole space operated via a single Markov chain (or by parallel chains acting independently which all have to “converge”) with parallel and independent explorations of parts of the space by separate Markov chains. “Small is beautiful”: it takes a shorter while to explore each set of the partition, hence to converge, and, more importantly, each chain can work in parallel to the others. More specifically, given a partition of the space, into sets Ai with posterior weights wi, parallel chains are associated with targets equal to the original target restricted to those Ai‘s. This is therefore an MCMC version of partitioned sampling. With regard to the shortcomings listed in the quote above, the authors consider that there does not need to be a bijection between the partition sets and the chains, in that a chain can move across partitions and thus contribute to several integral evaluations simultaneously. I am a bit worried about this argument since it amounts to getting a random number of simulations within each partition set Ai. In my (maybe biased) perception of partitioned sampling, this sounds somewhat counter-productive, as it increases the variance of the overall estimator. (Of course, not restricting a chain to a given partition set Ai has the incentive of avoiding a possibly massive amount of rejection steps. It is however unclear (a) whether or not it impacts ergodicity (it all depends on the way the chain is constructed, i.e. against which target(s)…) as it could lead to an over-representation of some boundaries and (b) whether or not it improves the overall convergence properties of the chain(s).)
“The approach presented here represents a solution to this problem which can completely remove the waiting times for crossing between modes, leaving only the relatively short within-mode equilibration times.” (p.4)
A more delicate issue with the partitioned MCMC approach (in my opinion!) stands with the partitioning. Indeed, in a complex and high-dimension model, the construction of the appropriate partition is a challenge in itself as we often have no prior idea where the modal areas are. Waiting for a correct exploration of the modes is indeed faster than waiting for crossing between modes, provided all modes are represented and the chain for each partition set Ai has enough energy to explore this set. It actually sounds (slightly?) unlikely that a target with huge gaps between modes will see a considerable improvement from the partioned version when the partition sets Ai are selected on the go, because some of the boundaries between the partition sets may be hard to reach with a off-the-shelf proposal. (Obviously, the second part of the method on the adaptive construction of partitions is yet in the writing and I am looking forward its aXival!)
Furthermore, as noted by Pierre Jacob (of Statisfaction fame!), the adaptive construction of the partition has a lot in common with Wang-Landau schemes. Which goal is to produce a flat histogram proposal from the current exploration of the state space. Connections with Atchadé’s and Liu’s (2010, Statistical Sinica) extension of the original Wang-Landau algorithm could have been spelled out. Esp. as the Voronoï tessellation construct seems quite innovative in this respect.
Unfortunately, I will miss the incoming Bayes in Paris seminar next Thursday (27th February), as I will be flying to Montréal and then Québec at the time (despite having omitted to book a flight till now!). Indeed Amandine Shreck will give a talk at 2pm in room 18 of ENSAE, Malakoff, on A shrinkage-thresholding Metropolis adjusted Langevin algorithm for Bayesian variable selection, a work written jointly with Gersende Fort, Sylvain Le Corff, and Eric Moulines, and arXived at the end of 2013 (which may explain why I missed it!). Here is the abstract:
This paper introduces a new Markov Chain Monte Carlo method to perform Bayesian variable selection in high dimensional settings. The algorithm is a Hastings-Metropolis sampler with a proposal mechanism which combines (i) a Metropolis adjusted Langevin step to propose local moves associated with the differentiable part of the target density with (ii) a shrinkage-thresholding step based on the non-differentiable part of the target density which provides sparse solutions such that small components are shrunk toward zero. This allows to sample from distributions on spaces with different dimensions by actually setting some components to zero. The performances of this new procedure are illustrated with both simulated and real data sets. The geometric ergodicity of this new transdimensional Markov Chain Monte Carlo sampler is also established.
(I will definitely get a look at the paper over the coming days!)
Sasha Rakhlin from Wharton sent me this paper he wrote (and arXived) with Hariharan Narayanan on a specific Markov chain algorithm that handles sequential Monte Carlo problems for log-concave targets. By relying on novel (by my standards) mathematical techniques, they manage to obtain geometric ergodicity results for random-walk based algorithms and log-concave targets. One of the new tools is the notion of self-concordant barrier, a sort of convex potential function F associated with a reference convex set and with Lipschitz properties. The second tool is a Gaussian distribution based on the metric induced by F. The third is the Dikin walk Markov chain, which uses this Gaussian as proposal and moves almost like the Metropolis-Hastings algorithm, except that it rejects with at least a probability of ½. The scale (or step size) of the Gaussian proposal is determined by the regularity of the log-concave target. In that setting, the total variation distance between the target at the t-th level and the distribution of the Markov chain can be fairly precisely approximated. Which leads in turn to a scaling of the number of random walk steps that are necessary to ensure convergence. Depending on the pace of the moving target, a single step of the random walk may be sufficient, which is quite an interesting feature.
I was attending a lecture this morning at CREST by Patrice Bertail where he was using estimated renewal parameters on a Markov chain to build (asymptotically) convergent bootstrap procedures. Estimating renewal parameters is obviously of interest in MCMC algorithms as they can be used to assess the convergence of the associated Markov chain: That is, if the estimation does not induce a significant bias. Another question that came to me during the talk is that; since those convergence assessments techniques are formally holding for any small set, choosing the small set in order to maximise the renewal rate also maximises the number of renewal events and hence the number of terms in the control sequence: Thus, the maximal renewal rate þ is definitely a quantity of interest: Now, is this quantity þ an intrinsic parameter of the chain, i.e. a quantity that drives its mixing and/or converging behaviour(s)? For instance; an iid sequence has a renewal rate of 1; because the whole set is a “small” set. Informally, the time between two consecutive renewal events is akin to the time between two simulations from the target and stationary distribution, according to the Kac’s representation we used in our AAP paper with Jim Hobert. So it could be that þ is directly related with the effective sample size of the chain, hence the autocorrelation. (A quick web search did not produce anything relevant:) Too bad this question did not pop up last week when I had the opportunity to discuss it with Sean Meyn in Gainesville!
This week, my student Dona Skanji gave a presentation of the paper of Hastings “Monte Carlo sampling methods using Markov chains and their applications“, which set the rules for running MCMC algorithms, much more so than the original paper by Metropolis et al.
which presented an optimisation device. even though the latter clearly stated the Markovian principle of those algorithms and their use for integration. (This is definitely a classic, selected in the book Biometrika: One hundred years, by Mike Titterington and David Cox.) Here are her slides (the best Beamer slides so far!):
Given that I had already taught my lectures on Markov chains and on MCMC algorithms, the preliminary part of Dona’s talk was easier to compose and understanding the principles of the method was certainly more straightforward than for the other papers in the series. I think she nonetheless did a rather good job in summing up the paper, running this extra simulation for the Poisson distribution—with the interesting “mistake” of including the burnin time in the representation of the output and concluding about a poor convergence—and mentioning the Gibbs extension.I led the discussion of the seminar towards irreducibility conditions and Peskun’s ordering of Markov chains, which maybe could have been mentioned by Dona since she was aware Peskun was Hastings‘ student.