## Bayes factors revisited

Posted in Books, Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , , on March 22, 2021 by xi'an

“Bayes factor analyses are highly sensitive to and crucially depend on prior assumptions about model parameters (…) Note that the dependency of Bayes factors on the prior goes beyond the dependency of the posterior on the prior. Importantly, for most interesting problems and models, Bayes factors cannot be computed analytically.”

Daniel J. Schad, Bruno Nicenboim, Paul-Christian Bürkner, Michael Betancourt, Shravan Vasishth have just arXived a massive document on the Bayes factor, worrying about the computation of this common tool, but also at the variability of decisions based on Bayes factors, e.g., stressing correctly that

“…we should not confuse inferences with decisions. Bayes factors provide inference on hypotheses. However, to obtain discrete decisions (…) from continuous inferences in a principled way requires utility functions. Common decision heuristics (e.g., using Bayes factor larger than 10 as a discovery threshold) do not provide a principled way to perform decisions, but are merely heuristic conventions.”

The text is long and at times meandering (at least in the sections I read), while trying a wee bit too hard to bring up the advantages of using Bayes factors versus frequentist or likelihood solutions. (The likelihood ratio being presented as a “frequentist” solution, which I think is an incorrect characterisation.) For instance, the starting point of preferring a model with a higher marginal likelihood is presented as an evidence (oops!) rather than argumented. Since this quantity depends on both the prior and the likelihood, it being high or low is impacted by both. One could then argue that using its numerical value as an absolute criterion amounts to selecting the prior a posteriori as much as checking the fit to the data! The paper also resorts to the Occam’s razor argument, which I wish we could omit, as it is a vague criterion, wide open to misappropriation. It is also qualitative, rather than quantitative, hence does not help in calibrating the Bayes factor.

Concerning the actual computation of the Bayes factor, an issue that has always been a concern and a research topic for me, the authors consider only two “very common methods”, the Savage–Dickey density ratio method and bridge sampling. We discussed the shortcomings of the Savage–Dickey density ratio method with Jean-Michel Marin about ten years ago. And while bridge sampling is an efficient approach when comparing models of the same dimension, I have reservations about this efficiency in other settings. Alternative approaches like importance nested sampling, noise contrasting estimation or SMC samplers are often performing quite efficiently as normalising constant approximations. (Not to mention our version of harmonic mean estimator with HPD support.)

Simulation-based inference is based on the notion that simulated data can be produced from the predictive distributions. Reminding me of ABC model choice to some extent. But I am uncertain this approach can be used to calibrate the decision procedure to select the most appropriate model. We thought about using this approach in our testing by mixture paper and it is favouring the more complex of the two models. This seems also to occur for the example behind Figure 5 in the paper.

Two other points: first, the paper does not consider the important issue with improper priors, which are not rigorously compatible with Bayes factors, as I discussed often in the past. And second, Bayes factors are not truly Bayesian decision procedures, since they remove the prior weights on the models, thus the mention of utility functions therein seems inappropriate unless a genuine utility function can be produced.

## approximation of Bayes Factors via mixing

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on December 21, 2020 by xi'an

A [new version of a] paper by Chenguang Dai and Jun S. Liu got my attention when it appeared on arXiv yesterday. Due to its title which reminded me of a solution to the normalising constant approximation that we proposed in the 2010 nested sampling evaluation paper we wrote with Nicolas. Recovering bridge sampling—mentioned by Dai and Liu as an alternative to their approach rather than an early version—by a type of Charlie Geyer (1990-1994) trick. (The attached slides are taken from my MCMC graduate course, with a section on the approximation of Bayesian normalising constants I first wrote for a short course at Jim Berger’s 70th anniversary conference, in San Antonio.)

A difference with the current paper is that the authors “form a mixture distribution with an adjustable mixing parameter tuned through the Wang-Landau algorithm.” While we chose it by hand to achieve sampling from both components. The weight is updated by a simple (binary) Wang-Landau version, where the partition is determined by which component is simulated, ie by the mixture indicator auxiliary variable. Towards using both components on an even basis (à la Wang-Landau) and stabilising the resulting evaluation of the normalising constant. More generally, the strategy applies to a sequence of surrogate densities, which are chosen by variational approximations in the paper.

## one bridge further

Posted in Books, R, Statistics, University life with tags , , , , , , , , , , , , on June 30, 2020 by xi'an

Jackie Wong, Jon Forster (Warwick) and Peter Smith have just published a paper in Statistics & Computing on bridge sampling bias and improvement by splitting.

“… known to be asymptotically unbiased, bridge sampling technique produces biased estimates in practical usage for small to moderate sample sizes (…) the estimator yields positive bias that worsens with increasing distance between the two distributions. The second type of bias arises when the approximation density is determined from the posterior samples using the method of moments, resulting in a systematic underestimation of the normalizing constant.”

Recall that bridge sampling is based on a double trick with two samples x and y from two (unnormalised) densities f and g that are interverted in a ratio

$m \sum_{i=1}^n g(x_i)\omega(x_i) \Big/ n \sum_{i=1}^m f(y_i)\omega(y_i)$

of unbiased estimators of the inverse normalising constants. Hence biased. The more the less similar these two densities are. Special cases for ω include importance sampling [unbiased] and reciprocal importance sampling. Since the optimal version of the bridge weight ω is the inverse of the mixture of f and g, it makes me wonder at the performance of using both samples top and bottom, since as an aggregated sample, they also come from the mixture, as in Owen & Zhou (2000) multiple importance sampler. However, a quick try with a positive Normal versus an Exponential with rate 2 does not show an improvement in using both samples top and bottom (even when using the perfectly normalised versions)

morc=(sum(f(y)/(nx*dnorm(y)+ny*dexp(y,2)))+
sum(f(x)/(nx*dnorm(x)+ny*dexp(x,2))))/(
sum(g(x)/(nx*dnorm(x)+ny*dexp(x,2)))+
sum(g(y)/(nx*dnorm(y)+ny*dexp(y,2))))


at least in terms of bias… Surprisingly (!) the bias almost vanishes for very different samples sizes either in favour of f or in favour of g. This may be a form of genuine defensive sampling, who knows?! At the very least, this ensures a finite variance for all weights. (The splitting approach introduced in the paper is a natural solution to create independence between the first sample and the second density. This reminded me of our two parallel chains in AMIS.)

## 19 dubious ways to compute the marginal likelihood

Posted in Books, Statistics with tags , , , , , , , , , , on December 11, 2018 by xi'an

A recent arXival on nineteen different [and not necessarily dubious!] ways to approximate the marginal likelihood of a given topology of a philogeny tree that reminded me of our San Antonio survey with Jean-Michel Marin. This includes a version of the Laplace approximation called Laplus (!), accounting for the fact that branch lengths on the tree are positive but may have a MAP at zero. Using a Beta, Gamma, or log-Normal distribution instead of a Normal. For importance sampling, the proposals are derived from either the Laplus (!) approximate distributions or from the variational Bayes solution (based on an Normal product). Harmonic means are still used here despite the obvious danger, along with a defensive version that mixes prior and posterior. Naïve Monte Carlo means simulating from the prior, while bridge sampling seems to use samples from prior and posterior distributions. Path and modified path sampling versions are those proposed in 2008 by Nial Friel and Tony Pettitt (QUT). Stepping stone sampling appears like another version of path sampling, also based on a telescopic product of ratios of normalising constants, the generalised version relying on a normalising reference distribution that need be calibrated. CPO and PPD in the above table are two versions based on posterior predictive density estimates.

When running the comparison between so many contenders, the ground truth is selected as the values returned by MrBayes in a massive MCMC experiment amounting to 7.5 billions generations. For five different datasets. The above picture describes mean square errors for the probabilities of split, over ten replicates [when meaningful], the worst case being naïve Monte Carlo, with nested sampling and harmonic mean solutions close by. Similar assessments proceed from a comparison of Kullback-Leibler divergences. With the (predicatble?) note that “the methods do a better job approximating the marginal likelihood of more probable trees than less probable trees”. And massive variability for the poorest methods:

The comparison above does not account for time and since some methods are deterministic (and fast) there is little to do about this. The stepping steps solutions are very costly, while on the middle range bridge sampling outdoes path sampling. The assessment of nested sampling found in the conclusion is that it “would appear to be an unwise choice for estimating the marginal likelihoods of topologies, as it produces poor approximate posteriors” (p.12). Concluding at the Gamma Laplus approximation being the winner across all categories! (There is no ABC solution studied in this paper as the model likelihood can be computed in this setup, contrary to our own setting.)

## bridgesampling [R package]

Posted in pictures, R, Statistics, University life with tags , , , , , , , , , on November 9, 2017 by xi'an

Quentin F. Gronau, Henrik Singmann and Eric-Jan Wagenmakers have arXived a detailed documentation about their bridgesampling R package. (No wonder that researchers from Amsterdam favour bridge sampling!) [The package relates to a [52 pages] tutorial on bridge sampling by Gronau et al. that I will hopefully comment soon.] The bridge sampling methodology for marginal likelihood approximation requires two Monte Carlo samples for a ratio of two integrals. A nice twist in this approach is to use a dummy integral that is already available, with respect to a probability density that is an approximation to the exact posterior. This means avoiding the difficulties with bridge sampling of bridging two different parameter spaces, in possibly different dimensions, with potentially very little overlap between the posterior distributions. The substitute probability density is chosen as Normal or warped Normal, rather than a t which would provide more stability in my opinion. The bridgesampling package also provides an error evaluation for the approximation, although based on spectral estimates derived from the coda package. The remainder of the document exhibits how the package can be used in conjunction with either JAGS or Stan. And concludes with the following words of caution:

“It should also be kept in mind that there may be cases in which the bridge sampling procedure may not be the ideal choice for conducting Bayesian model comparisons. For instance, when the models are nested it might be faster and easier to use the Savage-Dickey density ratio (Dickey and Lientz 1970; Wagenmakers et al. 2010). Another example is when the comparison of interest concerns a very large model space, and a separate bridge sampling based computation of marginal likelihoods may take too much time. In this scenario, Reversible Jump MCMC (Green 1995) may be more appropriate.”