Michalis Titsias and Christopher Yau just arXived a paper entitled the Hamming Ball sampler. Aimed at large and complex discrete latent variable models. The completion method is called after Richard Hamming, who is associated with code correcting methods (reminding me of one of the Master courses I took on coding, 30 years ago…), because it uses the Hamming distance in a discrete version of the slice sampler. One of the reasons for this proposal is that conditioning upon the auxiliary slice variable allows for the derivation of normalisation constants otherwise unavailable. The method still needs some calibration in the choice of blocks that partition the auxiliary variable and in the size of the ball. One of the examples assessed in the paper is a variable selection problem with 1200 covariates, out of which only 2 are relevant, while another example deals with a factorial HMM, involving 10 hidden chains. Since the paper compares each example with the corresponding block Gibbs sampling solution, it means this Gibbs sampling version is not intractable. It would be interesting to see a case where the alternative is not available…
Archive for auxiliary variable
Second day at the Indo-French Centre for Applied Mathematics and the workshop. Maybe not the most exciting day in terms of talks (as I missed the first two plenary sessions by (a) oversleeping and (b) running across the campus!). However I had a neat talk with another conference participant that led to [what I think are] interesting questions… (And a very good meal in a local restaurant as the guest house had not booked me for dinner!)
To wit: given a target like
the simulation of λ can be demarginalised into the simulation of
where z is a latent (and artificial) variable. This means a Gibbs sampler simulating λ given z and z given λ can produce an outcome from the target (*). Interestingly, another completion is to consider that the zi‘s are U(0,yi) and to see the quantity
as an unbiased estimator of the target. What’s quite intriguing is that the quantity remains the same but with different motivations: (a) demarginalisation versus unbiasedness and (b) zi ∼ Exp(λ) versus zi ∼ U(0,yi). The stationary is the same, as shown by the graph below, the core distributions are [formally] the same, … but the reasoning deeply differs.
Obviously, since unbiased estimators of the likelihood can be justified by auxiliary variable arguments, this is not in fine a big surprise. Still, I had not thought of the analogy between demarginalisation and unbiased likelihood estimation previously. Continue reading
Vinayak Rao, Lizhen Lin and David Dunson just arXived a paper which proposes anew technique to handle intractable normalising constants. And which exact title is Data augmentation for models based on rejection sampling. (Paper that I read in the morning plane to B’ham, since this is one of my weeks in Warwick.) The central idea therein is that, if the sample density (aka likelihood) satisfies
where all terms but p are known in closed form, then completion by the rejected values of an hypothetical accept-reject algorithm−hypothetical in the sense that the data does not have to be produced by an accept-reject scheme but simply the above domination condition to hold−allows for a data augmentation scheme. Without requiring the missing normalising constant. Since the completed likelihood is
A closed-form, if not necessarily congenial, function.
Now this is quite a different use of the “rejected values” from the accept reject algorithm when compared with our 1996 Biometrika paper on the Rao-Blackwellisation of accept-reject schemes (which, still, could have been mentioned there… Or Section 4.2 of Monte Carlo Statistical Methods. Rather than re-deriving the joint density of the augmented sample, “accepted+rejected”.)
It is a neat idea in that it completely bypasses the approximation of the normalising constant. And avoids the somewhat delicate tuning of the auxiliary solution of Moller et al. (2006) The difficulty with this algorithm is however in finding an upper bound M on the unnormalised density f that is
- in closed form;
- with a manageable and tight enough “constant” M;
- compatible with running a posterior simulation conditional on the added rejections.
The paper seems to assume further that the bound M is independent from the current parameter value θ, at least as suggested by the notation (and Theorem 2), but this is not in the least necessary for the validation of the formal algorithm. Such a constraint would pull M higher, hence reducing the efficiency of the method. Actually the matrix Langevin distribution considered in the first example involves a bound that depends on the parameter κ.
The paper includes a result (Theorem 2) on the uniform ergodicity that relies on heavy assumptions on the proposal distribution. And a rather surprising one, namely that the probability of rejection is bounded from below, i.e. calling for a less efficient proposal. Now it seems to me that a uniform ergodicity result holds as well when the probability of acceptance is bounded from below since, then, the event when no rejection occurs constitutes an atom from the augmented Markov chain viewpoint. There therefore occurs a renewal each time the rejected variable set ϒ is empty, and ergodicity ensues (Robert, 1995, Statistical Science).
Note also that, despite the opposition raised by the authors, the method per se does constitute a pseudo-marginal technique à la Andrieu-Roberts (2009) since the independent completion by the (pseudo) rejected variables produces an unbiased estimator of the likelihood. It would thus be of interest to see how the recent evaluation tools of Andrieu and Vihola can assess the loss in efficiency induced by this estimation of the likelihood.
Maybe some further experimental evidence tomorrow…
And here is yet another arXived paper using a decomposition of the posterior distirbution as a product of terms to run faster, better and higher MCMC algorithms! This one is by Douglas Maclaurin and Ryan Adams: “Firefly Monte Carlo: Exact MCMC with Subsets of Data“. (While a swarm of fireflies make sense to explain the name, I may miss some cultural subliminal meaning in the title as Firefly and Monte Carlo seem to be places in Las Vegas (?), and a car brand, Firefly is a TV series, a clothes brand, and maybe other things…)
“The evolution of the chain evokes an image of fireflies, as the individual data blink on and out due to updates of the zn.”
The fundamental assumption of Maclaurin’s and Adams’ approach is that each product term in the likelihood (expressed as a product) can be bounded by a cheaper lower bound. This lower bound is used to create a Bernoulli auxiliary variable with probability equal to the ratio of the lower bound to the likelihood term, auxiliary variable that helps to reduce the number of evaluations of the original likelihood terms. Obviously, there is a gain only if (a) the lower bound is close or tight enough and (b) simulating the auxiliary variables is cheap enough.
About (a), the paper gives the tight example of a logistic, with a case of a 98% tightness. How generic is that and how those bounds can be derived in a cheap or automated manner? If one needs to run a variational Bayes approximation first, the gain in efficiency is unlikely to hold. About (b), I do not fully get it: if generating zn requires the evaluation of the original likelihood we loose the entire appeal of the method. Admittedly, I can see the point in changing a very small portion α of the zn‘s between moves on the parameter θ, since the number of likelihood evaluations is the same portion α of the total number of terms N. But decreasing the portion α is also reducing the mixing efficiency of the algorithm. In the efficient ways of updating the auxiliary brightness variables (ways proposed in the paper), I get the idea of making a proposal first before eventually computing the true probability of a Bernoulli. A proposal making use of the previous value of the probability (i.e. for the previous value of the parameter θ) could also reduce the number of evaluations of likelihood terms. However, using a “cached” version of the likelihood is only relevant within the same simulation step since a change in θ requires recomputing the likelihood.
“In each experiment we compared FlyMC, with two choices of bound selection, to regular full-posterior MCMC. We looked at the average number of likelihoods queried at each iteration and the number of effective samples generated per iteration, accounting for autocorrelation.”
This comparison does not seem adequate to me: by construction, the algorithm in the paper reduces the number of likelihood evaluations, so this is not a proper comparative instrument. The effective sample size is a transform of the correlation, not an indicator of convergence. For instance, if the zn‘s were hardly to change between iterations, thus the overall sampler was definitely far from converging, we would get θ’s simulated from almost the same distribution, hence being uncorrelated. In other words, if the joint chain in (θ,zn) does not converge, it is harder to establish that the subchain in θ converges at all. Indeed, in this logistic example where the computation of the likelihood is not a massive constraint, I am surprised there is any possibility of a huge gain in using the method, unless the lower bound is essentially the likelihood, which is actually the case for logistic regression models. Another point made by Dan Simpson is that the whole dataset needs to remain on-hold, full-time, which may be a challenge to the computer memory. And stops short of providing really Big Data solutions.
The first day at JSM is always a bit sluggish, as people slowly drip in and get their bearings. Similar to last year in Washington D.C., the meeting takes place in a huge conference centre and thus there is no feeling of overcrowded [so far]. It may also be that the peripheric and foreign location of the meeting put some regular attendees off (not to mention the expensive living costs!).
Nonetheless, the Sunday afternoon sessions started with a highly interesting How Fast Can We Compute? How Fast Will We Compute? session organised by Mike West and featuring Steve Scott, Mark Suchard and Qanli Wang. The topic was on parallel processing, either via multiple processors or via GPUS, the later relating to the exciting talk Chris Holmes gave at the Valencia meeting. Steve showed us some code in order to explain how feasible the jump to parallel programming—a point demonstrated by Julien Cornebise and Pierre Jacob after they returned from Valencia—was, while stressing the fact that a lot of the processing in MCMC runs was opened to parallelisation. For instance, data augmentation schemes can allocate the missing data in a parallel way in most problems and the same for independent data likelihood computation. Marc Suchard focussed on GPUs and phylogenetic trees, both of high interest to me!, and he stressed the huge gains—of the order of hundreds in the decrease in computing time—made possible by the exploitation of laptop [Macbook] GPUs. (If I got his example correctly, he seemed to be doing an exact computation of the phylogeny likelihood, not an ABC approximation… Which is quite interesting, if potentially killing one of my main areas of research!) Qanli Wang linked both previous with the example of mixtures with a huge number of components. Plenty of food for thought.
I completed the afternoon session with the Student Paper Competition: Bayesian Nonparametric and Semiparametric Methods which was discouragingly empty of participants, with two of the five speakers missing and less than twenty people in the room. (I did not get the point about the competition as to who was ranking those papers. Not the participants apparently!)