approximation of Bayes Factors via mixing

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on December 21, 2020 by xi'an

A [new version of a] paper by Chenguang Dai and Jun S. Liu got my attention when it appeared on arXiv yesterday. Due to its title which reminded me of a solution to the normalising constant approximation that we proposed in the 2010 nested sampling evaluation paper we wrote with Nicolas. Recovering bridge sampling—mentioned by Dai and Liu as an alternative to their approach rather than an early version—by a type of Charlie Geyer (1990-1994) trick. (The attached slides are taken from my MCMC graduate course, with a section on the approximation of Bayesian normalising constants I first wrote for a short course at Jim Berger’s 70th anniversary conference, in San Antonio.)

A difference with the current paper is that the authors “form a mixture distribution with an adjustable mixing parameter tuned through the Wang-Landau algorithm.” While we chose it by hand to achieve sampling from both components. The weight is updated by a simple (binary) Wang-Landau version, where the partition is determined by which component is simulated, ie by the mixture indicator auxiliary variable. Towards using both components on an even basis (à la Wang-Landau) and stabilising the resulting evaluation of the normalising constant. More generally, the strategy applies to a sequence of surrogate densities, which are chosen by variational approximations in the paper.

one bridge further

Posted in Books, R, Statistics, University life with tags , , , , , , , , , , , , on June 30, 2020 by xi'an

Jackie Wong, Jon Forster (Warwick) and Peter Smith have just published a paper in Statistics & Computing on bridge sampling bias and improvement by splitting.

“… known to be asymptotically unbiased, bridge sampling technique produces biased estimates in practical usage for small to moderate sample sizes (…) the estimator yields positive bias that worsens with increasing distance between the two distributions. The second type of bias arises when the approximation density is determined from the posterior samples using the method of moments, resulting in a systematic underestimation of the normalizing constant.”

Recall that bridge sampling is based on a double trick with two samples x and y from two (unnormalised) densities f and g that are interverted in a ratio

$m \sum_{i=1}^n g(x_i)\omega(x_i) \Big/ n \sum_{i=1}^m f(y_i)\omega(y_i)$

of unbiased estimators of the inverse normalising constants. Hence biased. The more the less similar these two densities are. Special cases for ω include importance sampling [unbiased] and reciprocal importance sampling. Since the optimal version of the bridge weight ω is the inverse of the mixture of f and g, it makes me wonder at the performance of using both samples top and bottom, since as an aggregated sample, they also come from the mixture, as in Owen & Zhou (2000) multiple importance sampler. However, a quick try with a positive Normal versus an Exponential with rate 2 does not show an improvement in using both samples top and bottom (even when using the perfectly normalised versions)

morc=(sum(f(y)/(nx*dnorm(y)+ny*dexp(y,2)))+
sum(f(x)/(nx*dnorm(x)+ny*dexp(x,2))))/(
sum(g(x)/(nx*dnorm(x)+ny*dexp(x,2)))+
sum(g(y)/(nx*dnorm(y)+ny*dexp(y,2))))


at least in terms of bias… Surprisingly (!) the bias almost vanishes for very different samples sizes either in favour of f or in favour of g. This may be a form of genuine defensive sampling, who knows?! At the very least, this ensures a finite variance for all weights. (The splitting approach introduced in the paper is a natural solution to create independence between the first sample and the second density. This reminded me of our two parallel chains in AMIS.)

19 dubious ways to compute the marginal likelihood

Posted in Books, Statistics with tags , , , , , , , , , , on December 11, 2018 by xi'an

A recent arXival on nineteen different [and not necessarily dubious!] ways to approximate the marginal likelihood of a given topology of a philogeny tree that reminded me of our San Antonio survey with Jean-Michel Marin. This includes a version of the Laplace approximation called Laplus (!), accounting for the fact that branch lengths on the tree are positive but may have a MAP at zero. Using a Beta, Gamma, or log-Normal distribution instead of a Normal. For importance sampling, the proposals are derived from either the Laplus (!) approximate distributions or from the variational Bayes solution (based on an Normal product). Harmonic means are still used here despite the obvious danger, along with a defensive version that mixes prior and posterior. Naïve Monte Carlo means simulating from the prior, while bridge sampling seems to use samples from prior and posterior distributions. Path and modified path sampling versions are those proposed in 2008 by Nial Friel and Tony Pettitt (QUT). Stepping stone sampling appears like another version of path sampling, also based on a telescopic product of ratios of normalising constants, the generalised version relying on a normalising reference distribution that need be calibrated. CPO and PPD in the above table are two versions based on posterior predictive density estimates.

When running the comparison between so many contenders, the ground truth is selected as the values returned by MrBayes in a massive MCMC experiment amounting to 7.5 billions generations. For five different datasets. The above picture describes mean square errors for the probabilities of split, over ten replicates [when meaningful], the worst case being naïve Monte Carlo, with nested sampling and harmonic mean solutions close by. Similar assessments proceed from a comparison of Kullback-Leibler divergences. With the (predicatble?) note that “the methods do a better job approximating the marginal likelihood of more probable trees than less probable trees”. And massive variability for the poorest methods:

The comparison above does not account for time and since some methods are deterministic (and fast) there is little to do about this. The stepping steps solutions are very costly, while on the middle range bridge sampling outdoes path sampling. The assessment of nested sampling found in the conclusion is that it “would appear to be an unwise choice for estimating the marginal likelihoods of topologies, as it produces poor approximate posteriors” (p.12). Concluding at the Gamma Laplus approximation being the winner across all categories! (There is no ABC solution studied in this paper as the model likelihood can be computed in this setup, contrary to our own setting.)

bridgesampling [R package]

Posted in pictures, R, Statistics, University life with tags , , , , , , , , , on November 9, 2017 by xi'an

Quentin F. Gronau, Henrik Singmann and Eric-Jan Wagenmakers have arXived a detailed documentation about their bridgesampling R package. (No wonder that researchers from Amsterdam favour bridge sampling!) [The package relates to a [52 pages] tutorial on bridge sampling by Gronau et al. that I will hopefully comment soon.] The bridge sampling methodology for marginal likelihood approximation requires two Monte Carlo samples for a ratio of two integrals. A nice twist in this approach is to use a dummy integral that is already available, with respect to a probability density that is an approximation to the exact posterior. This means avoiding the difficulties with bridge sampling of bridging two different parameter spaces, in possibly different dimensions, with potentially very little overlap between the posterior distributions. The substitute probability density is chosen as Normal or warped Normal, rather than a t which would provide more stability in my opinion. The bridgesampling package also provides an error evaluation for the approximation, although based on spectral estimates derived from the coda package. The remainder of the document exhibits how the package can be used in conjunction with either JAGS or Stan. And concludes with the following words of caution:

“It should also be kept in mind that there may be cases in which the bridge sampling procedure may not be the ideal choice for conducting Bayesian model comparisons. For instance, when the models are nested it might be faster and easier to use the Savage-Dickey density ratio (Dickey and Lientz 1970; Wagenmakers et al. 2010). Another example is when the comparison of interest concerns a very large model space, and a separate bridge sampling based computation of marginal likelihoods may take too much time. In this scenario, Reversible Jump MCMC (Green 1995) may be more appropriate.”

warp-U bridge sampling

Posted in Books, Statistics, Travel, University life with tags , , , , , , , , , , on October 12, 2016 by xi'an

[I wrote this set of comments right after MCqMC 2016 on a preliminary version of the paper so mileage may vary in terms of the adequation to the current version!]

In warp-U bridge sampling, newly arXived and first presented at MCqMC 16, Xiao-Li Meng continues (in collaboration with Lahzi Wang) his exploration of bridge sampling techniques towards improving the estimation of normalising constants and ratios thereof. The bridge sampling estimator of Meng and Wong (1996) is an harmonic mean importance sampler that requires iterations as it depends on the ratio of interest. Given that the normalising constant of a density does not depend on the chosen parameterisation in the sense that the Jacobian transform preserves this constant, a degree of freedom is in the choice of the parameterisation. This is the idea behind warp transformations. The initial version of Meng and Schilling (2002) used location-scale transforms, while the warp-U solution goes for a multiple location-scale transform that can be seen as based on a location-scale mixture representation of the target. With K components. This approach can also be seen as a sort of artificial reversible jump algorithm when one model is fully known. A strategy Nicolas and I also proposed in our nested sampling Biometrika paper.

Once such a mixture approximation is obtained. each and every component of the mixture can be turned into the standard version of the location-scale family by the appropriate location-scale transform. Since the component index k is unknown for a given X, they call this transform a random transform, which I find somewhat more confusing that helpful. The conditional distribution of the index given the observable x is well-known for mixtures and it is used here to weight the component-wise location-scale transforms of the original distribution p into something that looks rather similar to the standard version of the location-scale family. If no mode has been forgotten by the mixture. The simulations from the original p are then rescaled by one of those transforms, which index k is picked according to the conditional distribution. As explained later to me by XL, the random[ness] in the picture is due to the inclusion of a random ± sign. Still, in the notation introduced in (13), I do not get how the distribution Þ [sorry for using different symbols, I cannot render a tilde on a p] is defined since both ψ and W are random. Is it the marginal? In which case it would read as a weighted average of rescaled versions of p. I have the same problem with Theorem 1 in that I do not understand how one equates Þ with the joint distribution.

Equation (21) is much more illuminating (I find) than the previous explanation in that it exposes the fact that the principle is one of aiming at a new distribution for both the target and the importance function, with hopes that the fit will get better. It could have been better to avoid the notion of random transform, then, but this is mostly a matter of conveying the notion.

On more specifics points (or minutiae), the unboundedness of the likelihood is rarely if ever a problem when using EM. An alternative to the multiple start EM proposal would then be to get sequential and estimate the mixture in a sequential manner, only adding a component when it seems worth it. See eg Chopin and Pelgrin (2004) and Chopin (2007). This could also help with the bias mentioned therein since only a (tiny?) fraction of the data would be used. And the number of components K has an impact on the accuracy of the approximation, as in not missing a mode, and on the computing time. However my suggestion would be to avoid estimating K as this must be immensely costly.

Section 6 obviously relates to my folded Markov interests. If I understand correctly, the paper argues that the transformed density Þ does not need to be computed when considering the folding-move-unfolding step as a single step rather than three steps. I fear the description between equations (30) and (31) is missing the move step over the transformed space. Also on a personal basis I still do not see how to add this approach to our folding methodology, even though the different transforms act as as many replicas of the original Markov chain.