Another arXived paper in the recent series about big or tall data and how to deal with it by MCMC. Which pertains to the embarrassingly parallel category. As in the previously discussed paper, the authors (Xiangyu Wang, Fangjian Guo, Katherine Heller, and David Dunson) chose to break the prior itself into m bits… (An additional point from last week criticism is that, were an unbiased estimator of each term in the product available in an independent manner, the product of the estimators would be the estimator of the product.) In this approach, the kernel estimator of Neiswanger et al. is replaced with a random partition tree histogram. Which uses the same block partition across all terms in the product representation of the posterior. And hence ends up with a smaller number of terms in the approximation, since it does not explode with m. (They could have used Mondrian forests as well! However I think their quantification of the regular kernel method cost as an O(Tm) approach does not account for Neiswanger et al.’s trick in exploiting the product of kernels…) The so-called tree estimate can be turned into a random forest by repeating the procedure several times and averaging. The simulation comparison runs in favour of the current method when compared with other consensus or non-parametric methods. Except in the final graph (Figure 5) which shows several methods achieving the same prediction accuracy against running time.
Archive for embarassingly parallel
parallelizing MCMC with random partition trees
Posted in Books, pictures, Statistics, University life with tags big data, embarassingly parallel, huge data, MCMC, Mondrian forests, parallel MCMC, random partition trees, tall data on July 7, 2015 by xi'anvariational consensus Monte Carlo
Posted in Books, Statistics, University life with tags big data, consensus Monte Carlo, embarassingly parallel, large data problems, subsampling, tall data, variational Bayes methods on July 2, 2015 by xi'an“Unfortunately, the factorization does not make it immediately clear how to aggregate on the level of samples without first having to obtain an estimate of the densities themselves.” (p.2)
The recently arXived variational consensus Monte Carlo is a paper by Maxim Rabinovich, Elaine Angelino, and Michael Jordan that approaches the consensus Monte Carlo principle from a variational perspective. As in the embarrassingly parallel version, the target is split into a product of K terms, each being interpreted as an unnormalised density and being fed to a different parallel processor. The most natural partition is to break the data into K subsamples and to raise the prior to the power 1/K in each term. While this decomposition makes sense from a storage perspective, since each bit corresponds to a different subsample of the data, it raises the question of the statistical pertinence of splitting the prior and my feelings about it are now more lukewarm than when I commented on the embarrassingly parallel version, mainly for the reason that it is not reparameterisation invariant—getting different targets if one does the reparameterisation before or after the partition—and hence does not treat the prior as the reference measure it should be. I therefore prefer the version where the same original prior is attached to each part of the partitioned likelihood (and even more the random subsampling approaches discussed in the recent paper of Bardenet, Doucet, and Holmes). Another difficulty with the decomposition is that a product of densities is not a density in most cases (it may even be of infinite mass) and does not offer a natural path to the analysis of samples generated from each term in the product. Nor an explanation as to why those samples should be relevant to construct a sample for the original target.
“The performance of our algorithm depends critically on the choice of aggregation function family.” (p.5)
Since the variational Bayes approach is a common answer to complex products models, Rabinovich et al. explore the use of variational Bayes techniques to build the consensus distribution out of the separate samples. As in Scott et al., and Neiswanger et al., the simulation from the consensus distribution is a transform of simulations from each of the terms in the product, e.g., a weighted average. Which determines the consensus distribution as a member of an aggregation family defined loosely by a Dirac mass. When the transform is a sum of individual terms, variational Bayes solutions get much easier to find and the authors work under this restriction… In the empirical evaluation of this variational Bayes approach as opposed to the uniform and Gaussian averaging options in Scott et al., it improves upon those, except in a mixture example with a large enough common variance.
In fine, despite the relevance of variational Bayes to improve the consensus approximation, I still remain unconvinced about the use of the product of (pseudo-)densities and the subsequent mix of simulations from those components, for the reason mentioned above and also because the tail behaviour of those components is not related with the tail behaviour of the target. Still, this is a working solution to a real problem and as such is a reference for future works.
early rejection MCMC
Posted in Books, Statistics, University life with tags ABC, acceleration of MCMC algorithms, Bayesian Analysis, climate modelling, delayed acceptance, early rejection, embarassingly parallel, prefetching, summary statistics on June 16, 2014 by xi'anIn a (relatively) recent Bayesian Analysis paper on efficient MCMC algorithms for climate models, Antti Solonen, Pirkka Ollinaho, Marko Laine, Heikki Haario, Johanna Tamminen and Heikki Järvinen propose an early rejection scheme to speed up Metropolis-Hastings algorithms. The idea is to consider a posterior distribution (proportional to)
such that all terms in the product are less than one and to compare the uniform u in the acceptance step of the Metropolis-Hastings algorithm to
then, if u is smaller than the ratio, to
and so on, until the new value has been rejected or all terms have been evaluated. The scheme obviously stops earlier than the regular Metropolis-Hastings algorithm, at no significant extra cost when the product above does not factor through a sufficient statistic. Solonen et al. suggest ordering the terms so that the computationally simpler ones are computed first. The upper bound assumption requires and is equivalent to finding the maximum on each term of the product, though, which may be costly in its own for non-standard distributions. With my students Marco Banterle and Clara Grazian, we actually came upon this paper when preparing our delayed acceptance paper as (a) it belongs to the same category of accelerated MCMC methods (delayed acceptance and early rejection are somehow synonymous!) and (b) it mentions the early prefetching papers of Brockwell (2005) and Strid (2009).
“The acceptance probability in ABC is commonly very low, and many proposals are rejected, and ER can potentially help to detect the rejections sooner.”
In the conclusion, Solonen et al. point out a possible link with ABC but, apart from the general idea of rejecting earlier by looking at a subsample or at a proxy simulation of a summary statistics, which is also the idea at the core of Dennis Prangle’s lazy ABC, there is no obvious impact on a likelihood-free method like ABC.
[more] parallel MCMC
Posted in Books, Mountains with tags Banff, batch sampling, Chamonix-Mont-Blanc, Duke University, embarassingly parallel, Markov chain Monte Carlo, partition, partitioned sampling, Rao-Blackwellisation, split chain, Voronoi tesselation on April 3, 2014 by xi'anScott Schmidler and his Ph.D. student Douglas VanDerwerken have arXived a paper on parallel MCMC the very day I left for Chamonix, prior to MCMSki IV, so it is no wonder I missed it at the time. This work is somewhat in the spirit of the parallel papers Scott et al.’s consensus Bayes, Neiswanger et al.’s embarrassingly parallel MCMC, Wang and Dunson’s Weierstrassed MCMC (and even White et al.’s parallel ABC), namely that the computation of the likelihood can be broken into batches and MCMC run over those batches independently. In their short survey of previous works on parallelization, VanDerwerken and Schmidler overlooked our neat (!) JCGS Rao-Blackwellisation with Pierre Jacob and Murray Smith, maybe because it sounds more like post-processing than genuine parallelization (in that it does not speed up the convergence of the chain but rather improves the Monte Carlo usages one can make of this chain), maybe because they did not know of it.
“This approach has two shortcomings: first, it requires a number of independent simulations, and thus processors, equal to the size of the partition; this may grow exponentially in dim(Θ). Second, the rejection often needed for the restriction doesn’t permit easy evaluation of transition kernel densities, required below. In addition, estimating the relative weights wi with which they should be combined requires care.” (p.3)
The idea of the authors is to replace an exploration of the whole space operated via a single Markov chain (or by parallel chains acting independently which all have to “converge”) with parallel and independent explorations of parts of the space by separate Markov chains. “Small is beautiful”: it takes a shorter while to explore each set of the partition, hence to converge, and, more importantly, each chain can work in parallel to the others. More specifically, given a partition of the space, into sets Ai with posterior weights wi, parallel chains are associated with targets equal to the original target restricted to those Ai‘s. This is therefore an MCMC version of partitioned sampling. With regard to the shortcomings listed in the quote above, the authors consider that there does not need to be a bijection between the partition sets and the chains, in that a chain can move across partitions and thus contribute to several integral evaluations simultaneously. I am a bit worried about this argument since it amounts to getting a random number of simulations within each partition set Ai. In my (maybe biased) perception of partitioned sampling, this sounds somewhat counter-productive, as it increases the variance of the overall estimator. (Of course, not restricting a chain to a given partition set Ai has the incentive of avoiding a possibly massive amount of rejection steps. It is however unclear (a) whether or not it impacts ergodicity (it all depends on the way the chain is constructed, i.e. against which target(s)…) as it could lead to an over-representation of some boundaries and (b) whether or not it improves the overall convergence properties of the chain(s).)
“The approach presented here represents a solution to this problem which can completely remove the waiting times for crossing between modes, leaving only the relatively short within-mode equilibration times.” (p.4)
A more delicate issue with the partitioned MCMC approach (in my opinion!) stands with the partitioning. Indeed, in a complex and high-dimension model, the construction of the appropriate partition is a challenge in itself as we often have no prior idea where the modal areas are. Waiting for a correct exploration of the modes is indeed faster than waiting for crossing between modes, provided all modes are represented and the chain for each partition set Ai has enough energy to explore this set. It actually sounds (slightly?) unlikely that a target with huge gaps between modes will see a considerable improvement from the partioned version when the partition sets Ai are selected on the go, because some of the boundaries between the partition sets may be hard to reach with a off-the-shelf proposal. (Obviously, the second part of the method on the adaptive construction of partitions is yet in the writing and I am looking forward its aXival!)
Furthermore, as noted by Pierre Jacob (of Statisfaction fame!), the adaptive construction of the partition has a lot in common with Wang-Landau schemes. Which goal is to produce a flat histogram proposal from the current exploration of the state space. Connections with Atchadé’s and Liu’s (2010, Statistical Sinica) extension of the original Wang-Landau algorithm could have been spelled out. Esp. as the Voronoï tessellation construct seems quite innovative in this respect.
where did the normalising constants go?! [part 2]
Posted in R, Statistics, Travel with tags beta distribution, big data, consensus, embarassingly parallel, importance sampling, normalising constant, parallel processing, R on March 12, 2014 by xi'anComing (swiftly and smoothly) back home after this wonderful and intense week in Banff, I hugged my loved ones, quickly unpacked, ran a washing machine, and then sat down to check where and how my reasoning was wrong. To start with, I experimented with a toy example in R:
# true target is (x^.7(1-x)^.3) (x^1.3 (1-x)^1.7) # ie a Beta(3,3) distribution # samples from partial posteriors N=10^5 sam1=rbeta(N,1.7,1.3) sam2=rbeta(N,2.3,2.7) # first version: product of density estimates dens1=density(sam1,from=0,to=1) dens2=density(sam2,from=0,to=1) prod=dens1$y*dens2$y # normalising by hand prod=prod*length(dens1$x)/sum(prod) plot(dens1$x,prod,type="l",col="steelblue",lwd=2) curve(dbeta(x,3,3),add=TRUE,col="sienna",lty=3,lwd=2) # second version: F-S & P's yin+yang sampling # with weights proportional to the other posterior subsam1=sam1[sample(1:N,N,prob=dbeta(sam1,2.3,2.7),rep=T)] plot(density(subsam1,from=0,to=1),col="steelblue",lwd=2) curve(dbeta(x,3,3),add=T,col="sienna",lty=3,lwd=2) subsam2=sam2[sample(1:N,N,prob=dbeta(sam2,1.7,1.3),rep=T)] plot(density(subsam2,from=0,to=1),col="steelblue",lwd=2) curve(dbeta(x,3,3),add=T,col="sienna",lty=3,lwd=2)
and (of course!) it produced the perfect fits reproduced below. Writing the R code acted as a developing bath as it showed why we could do without the constants!
“Of course”, because the various derivations in the above R code all are clearly independent from the normalising constant: (i) when considering a product of kernel density estimators, as in the first version, this is an approximation of
as well as of
since the constant does not matter. (ii) When considering a sample from mi and weighting it by the product of the remaining true or estimated mj‘s, this is a sampling weighting resampling simulation from the density proportional to the product and hence, once again, the constants do not matter. At last, (iii) when mixing the two subsamples, since they both are distributed from the product density, the constants do not matter. As I slowly realised when running this morning (trail-running, not code-runninh!, for the very first time in ten days!), the straight-from-the-box importance sampling version on the mixed samples I considered yesterday (to the point of wondering out loud where did the constants go) is never implemented in the cited papers. Hence, the fact that
is enough to justify handling the target directly as the product of the partial marginals. End of the mystery. Anticlimactic end, sorry…