## self-healing umbrella sampling

Posted in Kids, pictures, Statistics, University life with tags , , , , , , , on November 5, 2014 by xi'an

Ten days ago, Gersende Fort, Benjamin Jourdain, Tony Lelièvre, and Gabriel Stoltz arXived a study about an adaptive umbrella sampler that can be re-interpreted as a Wang-Landau algorithm, if not the most efficient version of the latter. This reminded me very much of the workshop we had all together in Edinburgh last June. And even more of the focus of the molecular dynamics talks in this same ICMS workshop about accelerating the MCMC exploration of multimodal targets. The self-healing aspect of the sampler is to adapt to the multimodal structure thanks to a partition that defines a biased sampling scheme spending time in each set of the partition in a frequency proportional to weights. While the optimal weights are the weights of the sets against the target distribution (are they truly optimal?! I would have thought lifting low density regions, i.e., marshes, could improve the mixing of the chain for a given proposal), those are unknown and they need to be estimated by an adaptive scheme that makes staying in a given set the less desirable the more one has visited it. By increasing the inverse weight of a given set by a factor each time it is visited. Which sounds indeed like Wang-Landau. The plus side of the self-healing umbrella sampler is that it only depends on a scale γ (and on the partition). Besides converging to the right weights of course. The downside is that it does not reach the most efficient convergence, since the adaptivity weight decreases in 1/n rather than 1/√n.

Note that the paper contains a massive experimental side where the authors checked the impact of various parameters by Monte Carlo studies of estimators involving more than a billion iterations. Apparently repeated a large number of times.

The next step in adaptivity should be about the adaptive determination of the partition, hoping for a robustness against the dimension of the space. Which may be unreachable if I judge by the apparent deceleration of the method when the number of terms in the partition increases.

## MCqMC 2014 [day #3]

Posted in pictures, Running, Statistics, Travel, University life, Wines with tags , , , , , , , , , , , , , , , , on April 10, 2014 by xi'an

As the second day at MCqMC 2014, was mostly on multi-level Monte Carlo and quasi-Monte Carlo methods, I did not attend many talks but had a long run in the countryside (even saw a pheasant and a heron), worked at “home” on pressing recruiting evaluations and had a long working session with Pierre Jacob. Plus an evening out sampling (just) a few Belgian beers in the shade of the city hall…

Today was more in my ballpark as there were MCMC talks the whole day! The plenary talk was not about MCMC as Erich Novak presented a survey on the many available results bounding the complexity of approximating an integral based on a fixed number of evaluations of the integrand, some involving the dimension (and its curse), some not, some as fast as √n and some not as fast, all this depending on the regularity and the size of the classes of integrands considered. In some cases, the solution was importance sampling, in other cases, quasi-Monte Carlo, and yet other cases were still unsolved. Then Yves Atchadé gave a new perspective on computing the asymptotic variance in the central limit theorem on Markov chains when truncating the autocovariance, Matti Vihola talked about theoretical orderings of Markov chains that transmuted into the very practical consequence that using more simulations in a pseudo-marginal likelihood approximation improved acceptance rate and asymptotic variances (and this applies to aBC-MCMC as well), Radu Craiu proposed a novel processing of adaptive MCMC by treating various approximations to the true target as food for a multiple-try Metropolis algorithm, and Luca Martino had a go at resuscitating the ARMS algorithm of Gilks and Wild (used for a while in BUGS), although the talk did not dissipate all of my misgivings about the multidimensional version! I had more difficulties following the “Warwick session” which was made of four talks by current or former students from Warwick, although I appreciated the complexity of the results in infinite dimensional settings and novel approximations to diffusion based Metropolis algorithms. No further session this afternoon as the “social” activity was to visit the nearby Stella Artois brewery! This activity made us very social, for certain, even though there was hardly a soul around in this massively automated factory. (Maybe an ‘Og post to come one of those days…)

## sticky Metropolis

Posted in Statistics, University life with tags , , , , , , on September 6, 2013 by xi'an

My former student Roberto Casarin and his colleagues wrote (and arXived) a paper entitled Adaptive sticky generalized Metropolis algorithm. The basic idea is to use some of the rejected and past values of the chain to build an adaptive proposal, the criterion for choosing those values being related with the distance at the rejected point between the target and the proposal. In a sense, it gives a reward to surprising points, i.e. points where the proposal does poorly in approximating the target. On top of this, they include a multiple-try strategy where several values are generated from the current proposal and one of them is selected, to be accepted or rejected in a Metropolis step. The learning set may include several of the proposed (and rejected) values. This paper generalises Holden, Hauge and Holden (AoAP, 2009) and extends their proof of stationarity. The authors explore at length (the paper is 63 pages long!) the construction of the adaptive proposal distribution. This construction appears to be quite similar to Gilks’ and Wild’s (1993) ARMS algorithm. Hence, unless I missed a generalisation, it seems to me that the solutions are restricted to unidimensional settings. For instance, the authors propose to implement their algorithm for each complex conditional in a Gibbs sampler, meaning starting from scratch and running a large enough number of iterations to “reach” convergence. I also wonder at the correspondence between this construction and the original assumption of a minorisation condition wrt the target density in the event of an unbounded support. While this paper represents an interesting extension of the automated simulation algorithms of the ARMS type, and while the method is investigated thoroughly by several simulation experiments (in the second half of the paper), I remain somehow circumspect at the possibly of using ASMTM in complex high-dimensional problems as the learning cost soar with the dimension.

## adaptive Metropolis-Hastings sampling using reversible dependent mixture proposals

Posted in Statistics with tags , , , , , on May 23, 2013 by xi'an

In the plane to Birmingham, I was reading this recent arXived paper by Minh-Ngoc Tran, Michael K. Pitt, and Robert Kohn. The adaptive structure of their ACMH algorithm is based upon two parallel Markov chains, the former (called the trial chain) feeding the proposal densities of the later (called the main chain), bypassing the more traditional diminishing adaptation conditions. (Even though convergence actually follows from a minorisation condition.) These proposals are mixtures of t distributions fitted by variational Bayes approximations. Furthermore, the proposals are (a) reversible and (b) mixing local [dependent] and global [independent] components. One nice aspect of the reversibility is that the proposals do not have to be evaluated at each step.

The convergence results in the paper indeed assume a uniform minorisation condition on all proposal densities: although this sounded restrictive at first (but allows for straightforward proofs), I realised this could be implemented by adding a specific component to the mixture as in Corollary 3. (I checked the proof to realise that the minorisation on the proposal extends to the minorisation on the Metropolis-Hastings transition kernel.) A reversible kernel is defined as satisfying the detailed balance condition, which means that a single Gibbs step is reversible even though the Gibbs sampler as a whole is not. If a reversible Markov kernel with stationary distribution ζ is used, the acceptance probability in the Metropolis-Hastings transition is

α(x,z) = min{1,π(z)ζ(x)/π(x)ζ(z)}

(a result I thought was already known). The sweet deal is that the transition kernel involves Dirac masses, but the acceptance probability bypasses the difficulty. The way mixtures of t distributions can be reversible follows from Pitt & Walker (2006) construction, with  ζ  a specific mixture of t distributions. This target is estimated by variational Bayes. The paper further bypasses my classical objection to the use of normal, t or mixtures thereof, distributions:  this modelling assumes a sort of common Euclidean space for all components, which is (a) highly restrictive and (b) very inefficient in terms of acceptance rate. Instead, Tran & al. resort to Metropolis-within-Gibbs by constructing a partition of the components into subgroups.

## optimal direction Gibbs

Posted in Statistics, University life with tags , , , , , , on May 29, 2012 by xi'an

An interesting paper appeared on arXiv today. Entitled On optimal direction gibbs sampling, by Andrés Christen, Colin Fox, Diego Andrés Pérez-Ruiz and Mario Santana-Cibrian, it defines optimality as picking the direction that brings the maximum independence between two successive realisations in the Gibbs sampler. More precisely, it aims at choosing the direction e that minimises the mutual information criterion

$\int\int f_{Y,X}(y,x)\log\dfrac{f_{Y,X}(y,x)}{f_Y(y)f_X(x)}\,\text{d}x\,\text{d}y$

I have a bit of an issue about this choice because it clashes with measure theory. Indeed, in one Gibbs step associated with e the transition kernel is defined in terms of the Lebesgue measure over the line induced by e. Hence the joint density of the pair of successive realisations is defined in terms of the product of the Lebesgue measure on the overall space and of the Lebesgue measure over the line induced by e… While the product in the denominator is defined against the product of the Lebesgue measure on the overall space and itself. The two densities are therefore not comparable since not defined against equivalent measures… The difference between numerator and denominator is actually clearly expressed in the normal example (page 3) when the chain operates over a n dimensional space, but where the conditional distribution of the next realisation is one-dimensional, thus does not relate with the multivariate normal target on the denominator. I therefore do not agree with the derivation of the mutual information henceforth produced as (3).

The above difficulty is indirectly perceived by the authors, who note “we cannot simply choose the best direction: the resulting Gibbs sampler would not be irreducible” (page 5), an objection I had from an earlier page… They instead pick directions at random over the unit sphere and (for the normal case) suggest using a density over those directions such that

$h^*(\mathbf{e})\propto(\mathbf{e}^\prime A\mathbf{e})^{1/2}$

which cannot truly be called “optimal”.

More globally, searching for “optimal” directions (or more generally transforms) is quite a worthwhile idea, esp. when linked with adaptive strategies…

## MCMC at ICMS (3)

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , on April 26, 2012 by xi'an

The intense pace of the two first days of our workshop on MCMC at ICMS had apparently taken an heavy toll on the participants as a part of the audience was missing this morning! Although not as a consequence of the haggis of the previous night at the conference dinner, nor even as a result of the above pace. In fact, the missing participants had opted ahead of time for leaving the workshop early, which is understandable given everyone’s busy schedule, esp. for those attending both Bristol and Edinburgh workshops, however slightly impacting the atmosphere of the final day. (Except for Mark Girolami who most unfortunately suffered such a teeth infection that he had to seek urgent medical assistance yesterday afternoon. Best wishes to Mark for a prompt recovery, say I with a dental appointment tomorrow…!)

In [what I now perceive as] another recurrent theme of the workshop, namely the recourse to Gaussian structures like Gaussian processes (see, e.g., Ian Murray’s talk yesterday), Andrew Stuart gave us a light introduction to random walk Metropolis-Hastings algorithms on Hilbert spaces. In particular, he related to Ian Murray’s talk of yesterday as to the definition of a “new” random walk (due to Radford Neal)  that makes a proposal

$y=\sqrt{1-\beta^2}x_{t-1}+\beta\zeta\quad 0<\beta<1,\zeta\sim\varphi(|\zeta|)$

that still preserves the acceptance probability of the original (“old”) random walk proposal. The final talks of the morning were Krys Latuszynski’s and Nick Whiteley’s very pedagogical presentations of the convergence properties of manifold MALA and of particle filters for hidden Markov models.  In both cases, the speakers avoided the overly technical details and provided clear intuition in the presented results, a great feat after those three intense days of talks! (Having attended Nick’s talk in Paris two weeks ago helped of course.)

Unfortunately, due to very limited flight options (after one week of traveling around the UK) and also being slightly worried at the idea of missing my flight!, I had to leave the meeting along with all my French colleagues right after Jean-Michel Marin’s talk on (hidden) Potts driven mixtures, explaining the computational difficulties in deriving marginal likelihoods. I thus missed the final talk of the workshop by Gareth Tribello. And delivering my final remarks at the lunch break.

Overall, when reflecting on those two Monte Carlo workshops, I feel I preferred the pace of the Bristol workshop, because it allowed for more interactions between the participants by scheduling less talks… This being said, the organization at ICMS was superb (as usual!) and the talks were uniformly very good so it also was a very profitable meeting, of a different kind! As written earlier, among other things, it induced (in me) some reflections on a possible new research topic with friends there. Looking forward to visit Scotland again, of course!

## Banff workshop [BIRS 12w5105 meeting [#2]]

Posted in Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , , on March 21, 2012 by xi'an

Today the program of 12w5105 was more on the theoretical side with adaptive MCMC in the morning and ABC in the afternoon. Éric Moulines and Gersende Fort shared a talk on two papers, one on adaptive tempering and the other one on equi-energy sampling, then Nando de Freitas spoke first about Gaussian process approximation for Bayesian optimisation, then about an adaptive Hamiltonian technique called Sardonics. And Jeff Rosenthal concluded the morning with a review of the results ensuring convergence for adaptive MCMC (with a delightful counter-example called Stairways to Heaven that reminded me of an ice climb in Utah!). After my talk, where Scott Sisson made an interesting comment on the difficulty to extend our framework to a large collection of models (since then the summary statistics have to differ), François Perron discussed in highly interesting details several approximation techniques for the Bayesian estimation of copulas and Scott Sisson presented his recent arXiv paper where a rough estimate of the joint posterior is obtained regression-adjustment ABC, and then estimates of each marginal posterior distribution are separately obtained in a lower-dimensional analysis, all this being connected with Bayes linear analysis. (I do not completely get the way summary statistics are selected for each marginal there, which seems to be done by hand. While I understand why using a lower-dimensional statistic helps in improving the approximation of the marginal posteriors and fights the curse of dimensionality, the fact that the joint posterior sample is based on different summary statistics for the different components makes an interesting statistical puzzle. Maybe the copula approach by François in the previous talk could be used at the final stage.) The final talk by Zhiqiang Tan on comparative performances of resampling and subsampling strategies generated a very animated discussion. (All talks being recorded, mine is available as an mp4 video but watch at your own peril!)