## slice sampling for Dirichlet mixture process

Posted in Books, Statistics, University life with tags , , , , , , , on June 21, 2017 by xi'an

When working with my PhD student Changye in Dauphine this morning I realised that slice sampling also applies to discrete support distributions and could even be of use in such settings. That it works is (now) straightforward in that the missing variable representation behind the slice sampler also applies to densities defined with respect to a discrete measure. That this is useful transpires from the short paper of Stephen Walker (2007) where we saw this, as Stephen relies on the slice sampler to sample from the Dirichlet mixture model by eliminating the tail problem associated with this distribution. (This paper appeared in Communications in Statistics and it is through Pati & Dunson (2014) taking advantage of this trick that Changye found about its very existence. I may have known about it in an earlier life, but I had clearly forgotten everything!)

While the prior distribution (of the weights) of the Dirichlet mixture process is easy to generate via the stick breaking representation, the posterior distribution is trickier as the weights are multiplied by the values of the sampling distribution (likelihood) at the corresponding parameter values and they cannot be normalised. Introducing a uniform to replace all weights in the mixture with an indicator that the uniform is less than those weights corresponds to a (latent variable) completion [or a demarginalisation as we called this trick in Monte Carlo Statistical Methods]. As elaborated in the paper, the Gibbs steps corresponding to this completion are easy to implement, involving only a finite number of components. Meaning the allocation to a component of the mixture can be operated rather efficiently. Or not when considering that the weights in the Dirichlet mixture are not monotone, hence that a large number of them may need to be computed before picking the next index in the mixture when the uniform draw happens to be quite small.

## common derivation for Metropolis–Hastings and other MCMC algorithms

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , on July 25, 2016 by xi'an

Khoa Tran and Robert Kohn from UNSW just arXived a paper on a comprehensive derivation of a large range of MCMC algorithms, beyond Metropolis-Hastings. The idea is to decompose the MCMC move into

1. a random completion of the current value θ into V;
2. a deterministic move T from (θ,V) to (ξ,W), where only ξ matters.

If this sounds like a new version of Peter Green’s completion at the core of his 1995 RJMCMC algorithm, it is because it is indeed essentially the same notion. The resort to this completion allows for a standard form of the Metropolis-Hastings algorithm, which leads to the correct stationary distribution if T is self-inverse. This representation covers Metropolis-Hastings algorithms, Gibbs sampling, Metropolis-within-Gibbs and auxiliary variables methods, slice sampling, recursive proposals, directional sampling, Langevin and Hamiltonian Monte Carlo, NUTS sampling, pseudo-marginal Metropolis-Hastings algorithms, and pseudo-marginal Hamiltonian  Monte Carlo, as discussed by the authors. Given this representation of the Markov chain through a random transform, I wonder if Peter Glynn’s trick mentioned in the previous post on retrospective Monte Carlo applies in this generic setting (as it could considerably improve convergence…)

## AISTATS 2016 [#2]

Posted in Kids, pictures, Running, Statistics, Travel, University life, Wines with tags , , , , , , , , , , , , , , , on May 13, 2016 by xi'an

The second and third days of AISTATS 2016 passed like a blur, with not even the opportunity to write my impressions in real time! Maybe long tapa breaks are mostly to blame for this… In any case, we had two further exciting plenary talks about privacy-preserving data analysis by Kamalika Chaudhuri and crowdsourcing and machine learning by Adam Tauman Kalai. The talk by Kamalika was covering recent results by Kamalika and coauthors about optimal privacy preservation in classification and a generalisation to correlated data, with the neat notion of a Markov Quilt.  Other talks that same day also dwelt on this privacy issue, but I could not be . The talk by Adam was full of fun illustrations on humans training learning systems (with the unsolved difficulty of those humans deliberately mis-training the system, as exhibited recently by the short-lived Microsoft Tay experiment).

Both poster sessions were equally exciting, with the addition of MLSS student posters on the final day. Among many, I particularly enjoyed Iain Murray’s pseudo-marginal slice sampling, David Duvenaud’s fairly intriguing use of early stopping for non-parametric inference,  Garrett Bernstein’s work on aggregated Markov chains, Ye Wang’s scalable geometric density estimation [with a special bonus for his typo on the University of Turing, instead of Torino], Gemma Moran’s and Chengtao Li’s posters on determinantal processes, and Matej Balog’s Mondrian forests with a Laplace kernel [envisioning potential applications for ABC]. Again, just to mention a few…

The participants [incl. myself] also took one evening off to visit a sherry winery in Jerez, with a well-practiced spiel on the story of the company, with some building designed by Gutave Eiffel, and with a wine-tasting session. As I personally find this type of brandy too strong in alcohol, I am not a big fan of sherry but it was nonetheless an amusing trip! With no visible after-effects the next morning, since the audience was as large as usual for Adam’s talk [although I did not cross a machine-learning soul on my 6am run…]

In short, I enjoyed very much AISTATS 2016 and remain deeply impressed by the efficiency of the selection process and the amount of involvement of the actors of this selection, as mentioned earlier on the ‘Og. Kudos!

## slice sampling revisited

Posted in Books, pictures, Statistics with tags , , , , , , , , on April 15, 2016 by xi'an

Thanks to an X validated question, I re-read Radford Neal’s 2003 Slice sampling paper. Which is an Annals of Statistics discussion paper, and rightly so. While I was involved in the editorial processing of this massive paper (!), I had only vague memories left about it. Slice sampling has this appealing feature of being the equivalent of random walk Metropolis-Hastings for Gibbs sampling, without the drawback of setting a scale for the moves.

“These slice sampling methods can adaptively change the scale of changes made, which makes them easier to tune than Metropolis methods and also avoids problems that arise when the appropriate scale of changes varies over the distribution  (…) Slice sampling methods that improve sampling by suppressing random walks can also be constructed.” (p.706)

One major theme in the paper is fighting random walk behaviour, of which Radford is a strong proponent. Even at the present time, I am a bit surprised by this feature as component-wise slice sampling is exhibiting clear features of a random walk, exploring the subgraph of the target by random vertical and horizontal moves. Hence facing the potential drawback of backtracking to previously visited places.

“A Markov chain consisting solely of overrelaxed updates might not be ergodic.” (p.729)

Overrelaxation is presented as a mean to avoid the random walk behaviour by removing rejections. The proposal is actually deterministic projecting the current value to the “other side” of the approximate slice. If it stays within the slice it is accepted. This “reflection principle” [in that it takes the symmetric wrt the centre of the slice] is also connected with antithetic sampling in that it induces rather negative correlation between the successive simulations. The last methodological section covers reflective slice sampling, which appears as a slice version of Hamiltonian Monte Carlo (HMC). Given the difficulty in implementing exact HMC (reflected in the later literature), it is no wonder that Radford proposes an approximation scheme that is valid if somewhat involved.

“We can show invariance of this distribution by showing (…) detailed balance, which for a uniform distribution reduces to showing that the probability density for x¹ to be selected as the next state, given that the current state is x0, is the same as the probability density for x⁰ to be the next state, given that x¹ is the current state, for any states x⁰ and x¹ within [the slice] S.” (p.718)

In direct connection with the X validated question there is a whole section of the paper on implementing single-variable slice sampling that I had completely forgotten, with a collection of practical implementations when the slice

S={x; u < f(x) }

cannot be computed in an exact manner. Like the “stepping out” procedure. The resulting set (interval) where the uniform simulation in x takes place may well miss some connected component(s) of the slice. This quote may sound like a strange argument in that the move may well leave a part of the slice off and still satisfy this condition. Not really since it states that it must hold for any pair of states within S… The very positive side of this section is to allow for slice sampling in cases where the inversion of u < f(x) is intractable. Hence with a strong practical implication. The multivariate extension of the approximation procedure is more (potentially) fraught with danger in that it may fell victim to a curse of dimension, in that the box for the uniform simulation of x may be much too large when compared with the true slice (or slice of the slice). I had more of a memory of the “trail of crumbs” idea, mostly because of the name I am afraid!, which links with delayed rejection, as indicated in the paper, but seems awfully delicate to calibrate.

## at CIRM

Posted in Books, Mountains, Running, Statistics, Travel, University life, Wines with tags , , , , , , , , , on March 1, 2016 by xi'an

Thanks to a very early start from Paris, and despite horrendous traffic jams in Marseilles, I managed to reach CIRM with ten minutes to spare before my course. After my one-hour class, I was suddenly made aware of the (simplistic) idea that the slice sampling uniforms are simply auxiliary, meaning they can be used in many different ways.

I noticed Natesh Pillai just arXived an extension of his earlier Cauchy paper with XL. He proves that the result on the Cauchy distribution of any convex combination of normal ratios still holds when the pair of vectors is distributed from a product of elliptically symmetric functions. Some of Natesh’s remarks reminded me of the 1970 Sankhyã paper by Kelker on spherically symmetric variables. Especially because of Kelker’s characterisation of elliptically symmetric functions as scale mixtures of normals, which makes perfect sense since the scale cancels.

As I skimmed through my slides yesterday, fearing everyone knew about the MCMC basics, I decided to present today the Rao-Blackwellisation slides I gave in Warwick a few months ago.

## pseudo slice sampling

Posted in Books, Statistics, University life with tags , , , , , on November 26, 2015 by xi'an

The workshop in Warwick last week made me aware of (yet) another arXiv posting I had missed: Pseudo-marginal slice sampling by Iain Murray and Matthew Graham. The idea is to mix the pseudo-marginal approach of Andrieu and Roberts (2009) with a noisy slice sampling scheme à la Neal (2003). The auxiliary random variable u used in the (pseudo-marginal) unbiased estimator of the target I(θ), Î(θ,u), and with distribution q(u) is merged with the random variable of interest so that the joint is

Î(θ,u)q(u)/C

and a Metropolis-Hastings proposal on that target simulating from k(θ,θ’)q(u’) [meaning the auxiliary is simulated independently] recovers the pseudo-marginal Metropolis-Hastings ratio

Î(θ’,u‘)k(θ’,θ)/Î(θ,u)k(θ,θ’)

(which is a nice alternative proof that the method works!). The novel idea in the paper is that the proposal on the auxiliary u can be of a different form, while remaining manageable. For instance, as a two-block Gibbs sampler. Or an elliptical slice sampler for the u component. The argument being that an independent update of u may lead the joint chain to get stuck. Among the illustrations in the paper, an Ising model (with no phase transition issue?) and a Gaussian process applied to the Pima Indian data set (despite a recent prohibition!). From the final discussion, I gather that the modification should be applicable to every (?) case when a pseudo-marginal approach is available, since the auxiliary distribution q(u) is treated as a black box. Quite an interesting read and proposal!

## vertical likelihood Monte Carlo integration

Posted in Books, pictures, Running, Statistics, Travel, University life with tags , , , , , , , on April 17, 2015 by xi'an

A few months ago, Nick Polson and James Scott arXived a paper on one of my favourite problems, namely the approximation of normalising constants (and it went way under my radar, as I only became aware of it quite recently!, then it remained in my travel bag for an extra few weeks…). The method for approximating the constant Z draws from an analogy with the energy level sampling methods found in physics, like the Wang-Landau algorithm. The authors rely on a one-dimensional slice sampling representation of the posterior distribution and [main innovation in the paper] add a weight function on the auxiliary uniform. The choice of the weight function links the approach with the dreaded harmonic estimator (!), but also with power-posterior and bridge sampling. The paper recommends a specific weighting function, based on a “score-function heuristic” I do not get. Further, the optimal weight depends on intractable cumulative functions as in nested sampling. It would be fantastic if one could draw directly from the prior distribution of the likelihood function—rather than draw an x [from the prior or from something better, as suggested in our 2009 Biometrika paper] and transform it into L(x)—but as in all existing alternatives this alas is not the case. (Which is why I find the recommendations in the paper for practical implementation rather impractical, since, were the prior cdf of L(X) available, direct simulation of L(X) would be feasible. Maybe not the optimal choice though.)

“What is the distribution of the likelihood ordinates calculated via nested sampling? The answer is surprising: it is essentially the same as the distribution of likelihood ordinates by recommended weight function from Section 4.”

The approach is thus very much related to nested sampling, at least in spirit. As the authors later demonstrate, nested sampling is another case of weighting, Both versions require simulations under truncated likelihood values. Albeit with a possibility of going down [in likelihood values] with the current version. Actually, more weighting could prove [more] efficient as both the original nested and vertical sampling simulate from the prior under the likelihood constraint. Getting away from the prior should help. (I am quite curious to see how the method is received and applied.)