In her plenary talk this morning, Christine Lemieux discussed connections between quasi-Monte Carlo and copulas, covering a question I have been considering for a while. Namely, when provided with a (multivariate) joint cdf F, is there a generic way to invert a vector of uniforms [or quasi-uniforms] into a simulation from F? For Archimedian copulas (as we always can get back to copulas), there is a resolution by the Marshall-Olkin representation, but this puts a restriction on the distributions F that can be considered. The session on synthetic likelihoods [as introduced by Simon Wood in 2010] put together by Scott Sisson was completely focussed on using normal approximations for the distribution of the vector of summary statistics, rather than the standard ABC non-parametric approximation. While there is a clear (?) advantage in using a normal pseudo-likelihood, since it stabilises with much less simulations than a non-parametric version, I find it difficult to compare both approaches, as they lead to different posterior distributions. In particular, I wonder at the impact of the dimension of the summary statistics on the approximation, in the sense that it is less and less likely that the joint is normal as this dimension increases. Whether this is damaging for the resulting inference is another issue, possibly handled by a supplementary ABC step that would take the first-step estimate as summary statistic. (As a side remark, I am intrigued at everyone being so concerned with unbiasedness of methods that are approximations with no assessment of the amount of approximation!) The last session of the day was about multimodality and MCMC solutions, with talks by Hyungsuk Tak, Pierre Jacob and Babak Shababa, plus mine. Hunsuk presented the RAM algorithm I discussed earlier under the title of “love-hate” algorithm, which was a kind reference to my post! (I remain puzzled by the ability of the algorithm to jump to another mode, given that the intermediary step aims at a low or even zero probability region with an infinite mass target.) And Pierre talked about using SMC for Wang-Landau algorithms, with a twist to the classical stochastic optimisation schedule that preserves convergence. And a terrific illustration on a distribution inspired from the Golden Gate Bridge that reminded me of my recent crossing! The discussion around my folded Markov chain talk focussed on the extension of the partition to more than two sets, the difficulty being in generating automated projections, with comments about connections with computer graphic tools. (Too bad that the parallel session saw talks by Mark Huber and Rémi Bardenet that I missed! Enjoying a terrific Burmese dinner with Rémi, Pierre and other friends also meant I could not post this entry on time for the customary 00:16. Not that it matters in the least…)
Archive for Wang-Landau algorithm
A few months ago, Nick Polson and James Scott arXived a paper on one of my favourite problems, namely the approximation of normalising constants (and it went way under my radar, as I only became aware of it quite recently!, then it remained in my travel bag for an extra few weeks…). The method for approximating the constant Z draws from an analogy with the energy level sampling methods found in physics, like the Wang-Landau algorithm. The authors rely on a one-dimensional slice sampling representation of the posterior distribution and [main innovation in the paper] add a weight function on the auxiliary uniform. The choice of the weight function links the approach with the dreaded harmonic estimator (!), but also with power-posterior and bridge sampling. The paper recommends a specific weighting function, based on a “score-function heuristic” I do not get. Further, the optimal weight depends on intractable cumulative functions as in nested sampling. It would be fantastic if one could draw directly from the prior distribution of the likelihood function—rather than draw an x [from the prior or from something better, as suggested in our 2009 Biometrika paper] and transform it into L(x)—but as in all existing alternatives this alas is not the case. (Which is why I find the recommendations in the paper for practical implementation rather impractical, since, were the prior cdf of L(X) available, direct simulation of L(X) would be feasible. Maybe not the optimal choice though.)
“What is the distribution of the likelihood ordinates calculated via nested sampling? The answer is surprising: it is essentially the same as the distribution of likelihood ordinates by recommended weight function from Section 4.”
The approach is thus very much related to nested sampling, at least in spirit. As the authors later demonstrate, nested sampling is another case of weighting, Both versions require simulations under truncated likelihood values. Albeit with a possibility of going down [in likelihood values] with the current version. Actually, more weighting could prove [more] efficient as both the original nested and vertical sampling simulate from the prior under the likelihood constraint. Getting away from the prior should help. (I am quite curious to see how the method is received and applied.)
Ten days ago, Gersende Fort, Benjamin Jourdain, Tony Lelièvre, and Gabriel Stoltz arXived a study about an adaptive umbrella sampler that can be re-interpreted as a Wang-Landau algorithm, if not the most efficient version of the latter. This reminded me very much of the workshop we had all together in Edinburgh last June. And even more of the focus of the molecular dynamics talks in this same ICMS workshop about accelerating the MCMC exploration of multimodal targets. The self-healing aspect of the sampler is to adapt to the multimodal structure thanks to a partition that defines a biased sampling scheme spending time in each set of the partition in a frequency proportional to weights. While the optimal weights are the weights of the sets against the target distribution (are they truly optimal?! I would have thought lifting low density regions, i.e., marshes, could improve the mixing of the chain for a given proposal), those are unknown and they need to be estimated by an adaptive scheme that makes staying in a given set the less desirable the more one has visited it. By increasing the inverse weight of a given set by a factor each time it is visited. Which sounds indeed like Wang-Landau. The plus side of the self-healing umbrella sampler is that it only depends on a scale γ (and on the partition). Besides converging to the right weights of course. The downside is that it does not reach the most efficient convergence, since the adaptivity weight decreases in 1/n rather than 1/√n.
Note that the paper contains a massive experimental side where the authors checked the impact of various parameters by Monte Carlo studies of estimators involving more than a billion iterations. Apparently repeated a large number of times.
The next step in adaptivity should be about the adaptive determination of the partition, hoping for a robustness against the dimension of the space. Which may be unreachable if I judge by the apparent deceleration of the method when the number of terms in the partition increases.
Yesterday, Luke Bornn and Pierre Jacob gave a talk at our big’MC ‘minar. While I had seen most of the slides earlier, either at MCMski IV, Banff, Leuven or yet again in Oxford, I really enjoyed those talks as they provided further intuition about the techniques of Wang-Landau and non-negative unbiased estimators, leading to a few seeds of potential ideas for even more potential research. For instance, I understood way better the option to calibrate the Wang-Landau algorithm on levels of the target density rather than in the original space. Which means (a) a one-dimensional partition target (just as in nested sampling); (b) taking advantage of the existing computations of the likelihood function; and (b) a somewhat automatic implementation of the Wang-Landau algorithm. I do wonder why this technique is not more popular as a default option. (Like, would it be compatible with Stan?) The impossibility theorem of Pierre about the existence of non-negative unbiased estimators never ceases to amaze me. I started wondering during the seminar whether a positive (!) version of the result could be found. Namely, whether perturbations of the exact (unbiased) Metropolis-Hastings acceptance ratio could be substituted in order to guarantee positivity. Possibly creating drifted versions of the target…
One request in connection with this post: please connect the Institut Henri Poincaré to the eduroam wireless network! The place is dedicated to visiting mathematicians and theoretical physicists, it should have been the first one [in Paris] to get connected to eduroam. The cost cannot be that horrendous so I wonder what the reason is. Preventing guests from connecting to the Internet towards better concentration? avoiding “parasites” taking advantage of the network? ensuring seminar attendees are following the talks? (The irony is that Institut Henri Poincaré has a local wireless available for free, except that it most often does not work with my current machine. And hence wastes much more of my time as I attempt to connect over and over again while there.) Just in connection with IHP, a video of Persi giving a talk there about Poincaré, two years ago:
I hesitated in changing the above title for “MCqMSmaug” as the plenary talk I attended this morning was given by Wenzel Jakob, who uses Markov chain Monte Carlo methods in image rendering and light simulation. The talk was low-tech’, with plenty of pictures and animations (incl. excerpts from recent blockbusters!), but it stressed how much proper rending relies on powerful MCMC techniques. One point particularly attracted my attention, namely the notion of manifold exploration as it seemed related to my zero measure recent post. (A related video is available on Jakob’s webpage.) You may then wonder where the connection with Smaug could be found: Wenzel Jakob is listed in the credits of both Hobbit movies for his contributions to the visual effects! (Hey, MCMC made Smaug [visual effects the way they are], a cool argument for selling your next MCMC course! I will for sure include a picture of Smaug in my next R class presentation…) The next sessions of the morning opposed Sobol’s memorial to more technical light rendering and I chose Sobol, esp. because I had missed Art Owen’s tutorial on Sunday, as he gave a short presentation on using Sobol’s criteria to identify variables contributing the most to the variability or extreme values of a function, an extreme value kind of ANOVA, most interesting if far from my simulation area… The afternoon sessions saw MCMC talks by Luke Bornn and Scott Schmidler, both having connection with the Wang-Landau algorithm. Actually, Scott’s talk was the one generating the most animated discussion among all those I attended in MCqMC! (To the point of the chairman getting rather rudely making faces…)
Our paper “Bayesian computation for statistical models with intractable normalizing constants” written with Yves Atchadé and Nicolas Lartillot has now appeared in the Brazilian Journal of Probability and Statistics! (This is my first paper in this journal.) Although we could have used ABC steps to approximate the posterior, we chose instead a Wang-Landau approach that has exact convergence properties and extends the 1992 Read Paper by Charlie Geyer and Elisabeth Thompson.
Last day of a great workshop! I filled more pages of my black notebook (“bloc”) than in the past month!!! This morning started with an Hamiltonian session, Paul Fearnhead presenting recent developments in this area. I liked his coverage very much, esp. because it went away from the physics analogies that always put me off. The idea of getting away from the quadratic form had always seemed natural to me and provided an interesting range for investigations. (I think I rediscovered the topic during the talks, rephrasing almost the same questions as for Girolami’s and Calderhead’s Read Paper!) One thing that still intrigues me is the temporal dimension of the Hamiltonian representation. Indeed, it is “free” in the sense that the simulation problem does not depend on the time the pair (x,p) is moved along the equipotential curve. (In practice, there is a cost in running this move because it needs to be discretised.) But there is no clear target function to set the time “right”. The only scale I can think of is when the pair comes back by its starting point. Which is less silly than it sounds because the discretisation means that all intermediate points can be used, as suggested by Paul via a multiple try scheme. Mark then presented an application of Hamiltonian ideas and schemes to biochemical dynamics, with a supplementary trick of linearisation. Christian Lorenz Müller gave an ambitious grand tour of gradient free optimisation techniques that sounded appealing from a simulation perspective (but would require a few more hours to apprehend!), Geoff Nicholls presented on-going research on approximating Metropolis-Hastings acceptance probabilities in a more general perspective than à la Andrieu-Robert, i.e. accepting some amount of bias, an idea he has explained to me when I visited Oxford. And Pierre Jacob concluded the meeting in the right tone with a pot-pourri of his papers on Wang-Landau. (Once again a talk I had already heard but that helped me make more sense of a complex notion…)
Overall and talk-by-talk, a truly exceptional meeting. Which also set the bar quite high for us to compete at the ICMS meeting on advances in MCM next Monday! Esp. when a portion of the audience in Bristol will appear in Edinburgh as well!an In the meanwhile, I have to rewrite my talk for the seminar in Glasgow tomorrow in order to remove the overlap with my talk there last year… (I note that I have just managed to fly to Scotland with no lost bag, a true achievement!)