Archive for path sampling

19 dubious ways to compute the marginal likelihood

Posted in Books, Statistics with tags , , , , , , , , , , on December 11, 2018 by xi'an

A recent arXival on nineteen different [and not necessarily dubious!] ways to approximate the marginal likelihood of a given topology of a philogeny tree that reminded me of our San Antonio survey with Jean-Michel Marin. This includes a version of the Laplace approximation called Laplus (!), accounting for the fact that branch lengths on the tree are positive but may have a MAP at zero. Using a Beta, Gamma, or log-Normal distribution instead of a Normal. For importance sampling, the proposals are derived from either the Laplus (!) approximate distributions or from the variational Bayes solution (based on an Normal product). Harmonic means are still used here despite the obvious danger, along with a defensive version that mixes prior and posterior. Naïve Monte Carlo means simulating from the prior, while bridge sampling seems to use samples from prior and posterior distributions. Path and modified path sampling versions are those proposed in 2008 by Nial Friel and Tony Pettitt (QUT). Stepping stone sampling appears like another version of path sampling, also based on a telescopic product of ratios of normalising constants, the generalised version relying on a normalising reference distribution that need be calibrated. CPO and PPD in the above table are two versions based on posterior predictive density estimates.

When running the comparison between so many contenders, the ground truth is selected as the values returned by MrBayes in a massive MCMC experiment amounting to 7.5 billions generations. For five different datasets. The above picture describes mean square errors for the probabilities of split, over ten replicates [when meaningful], the worst case being naïve Monte Carlo, with nested sampling and harmonic mean solutions close by. Similar assessments proceed from a comparison of Kullback-Leibler divergences. With the (predicatble?) note that “the methods do a better job approximating the marginal likelihood of more probable trees than less probable trees”. And massive variability for the poorest methods:

The comparison above does not account for time and since some methods are deterministic (and fast) there is little to do about this. The stepping steps solutions are very costly, while on the middle range bridge sampling outdoes path sampling. The assessment of nested sampling found in the conclusion is that it “would appear to be an unwise choice for estimating the marginal likelihoods of topologies, as it produces poor approximate posteriors” (p.12). Concluding at the Gamma Laplus approximation being the winner across all categories! (There is no ABC solution studied in this paper as the model likelihood can be computed in this setup, contrary to our own setting.)

computational statistics and molecular simulation [18w5023]

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , on November 19, 2018 by xi'an

The last day of the X fertilisation workshop at the casa matematicà Oaxaca, there were only three talks and only half of the participants. I lost the subtleties of the first talk by Andrea Agazzi on large deviations for chemical reactions, due to an emergency at work (Warwick). The second talk by Igor Barahona was somewhat disconnected from the rest of the conference, working on document textual analysis by way of algebraic data analysis (analyse des données) methods à la Benzécri. (Who was my office neighbour at Jussieu in the early 1990s.) In the last and final talk, Eric Vanden-Eijden made a link between importance sampling and PDMP, as an integral can be expressed via a trajectory of a path. A generalisation of path sampling, for almost any ODE. But also a competitor to nested sampling, waiting for the path to reach an Hamiltonian level, without some of the difficulties plaguing nested sampling like resampling. And involving continuous time processes. (Is there a continuous time version of ABC as well?!) Returning unbiased estimators of mean (the original integral) and variance. Example of a mixture example in dimension d=10 with k=50 components using only 100 paths.

unbiased estimation of log-normalising constants

Posted in Statistics with tags , , , , , , , on October 16, 2018 by xi'an

Maxime Rischard, Pierre Jacob, and Natesh Pillai [warning: both of whom are co-authors and friends of mine!] have just arXived a paper on the use of path sampling (a.k.a., thermodynamic integration) for log-constant unbiased approximation and the resulting consequences on Bayesian model comparison by X validation. If the goal is the estimation of the log of a ratio of two constants, creating an artificial path between the corresponding distributions and looking at the derivative at any point of this path of the log-density produces an unbiased estimator. Meaning that random sampling along the path, corrected by the distribution of the sampling still produces an unbiased estimator. From there the authors derive an unbiased estimator for any X validation objective function, CV(V,T)=-log p(V|T), taking m observations T in and leaving n-m observations T out… The marginal conditional log density in the criterion is indeed estimated by an unbiased path sampler, using a powered conditional likelihood. And unbiased MCMC schemes à la Jacob et al. for simulating unbiased MCMC realisations of the intermediary targets on the path. Tuning it towards an approximately constant cost for all powers.

So in all objectivity and fairness (!!!), I am quite excited by this new proposal within my favourite area! Or rather two areas since it brings together the estimation of constants and an alternative to Bayes factors for Bayesian testing. (Although the paper does not broach upon the calibration of the X validation values.)

the penalty method

Posted in Statistics, University life with tags , , , , , , , , , , on July 7, 2016 by xi'an

“In this paper we will make conceptually simple generalization of Metropolis algorithm, by adjusting the acceptance ratio formula so that the transition probabilities are unaffected by the fluctuations in the estimate of [the acceptance ratio]…”

Last Friday, in Paris-Dauphine, my PhD student Changye Wu showed me a paper of Ceperley and Dewing entitled the penalty method for random walks with uncertain energies. Of which I was unaware of (and which alas pre-dated a recent advance made by Changye).  Despite its physics connections, the paper is actually about estimating a Metropolis-Hastings acceptance ratio and correcting the Metropolis-Hastings move for this estimation. While there is no generic solution to this problem, assuming that the logarithm of the acceptance ratio estimate is Gaussian around the true log acceptance ratio (and hence unbiased) leads to a log-normal correction for the acceptance probability.

“Unfortunately there is a serious complication: the variance needed in the noise penalty is also unknown.”

Even when the Gaussian assumption is acceptable, there is a further issue with this approach, namely that it also depends on an unknown variance term. And replacing it with an estimate induces further bias. So it may be that this method has not met with many followers because of those two penalising factors. Despite precluding the pseudo-marginal approach of Mark Beaumont (2003) by a few years, with the later estimating separately numerator and denominator in the Metropolis-Hastings acceptance ratio. And hence being applicable in a much wider collection of cases. Although I wonder if some generic approaches like path sampling or the exchange algorithm could be applied on a generic basis… [I just realised the title could be confusing in relation with the current football competition!]

commentaries in financial econometrics

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , on April 27, 2016 by xi'an

My comment(arie)s on the moment approach to Bayesian inference by Ron Gallant have appeared, along with other comment(arie)s:

Invited Article
Reflections on the Probability Space Induced by Moment Conditions with
Implications for Bayesian Inference
A. Ronald Gallant . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
Commentaries
Dante Amengual and Enrique Sentana .. . . . . . . . . . 248
John Geweke . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .253
Jae-Young Kim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
Oliver Linton and Ruochen Wu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .261
Christian P. Robert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
Christopher A. Sims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
Wei Wei and Asger Lunde . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . .278
Author Response
A. Ronald Gallant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .284

formula (4) in Gallant's paperWhile commenting on commentaries is formally bound to induce an infinite loop [or l∞p], I remain puzzled by the main point of the paper, which is that setting a structural distribution on a moment function Z(x,θ) plus a prior p(θ) induces a distribution on the pair (x,θ) in a possibly weaker σ-algebra. (The two distributions may actually be incompatible.) Handling this framework requires checking that a posterior exists, which sounds rather unnatural (even though we also have to check properness of the posterior). And the meaning of such a posterior remains unclear, as for instance in this assertion that (4) above is a likelihood, when it does not define a density in x but on the object inside the exponential.

“…it is typically difficult to determine whether there exists a p(x|θ) such that the implied distribution of m(x,θ) is the one stated, and if not, what damage is done thereby” J. Geweke (p.254)

Continue reading

estimating constants [impression soleil levant]

Posted in pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , on April 25, 2016 by xi'an

The CRiSM workshop on estimating constants which took place here in Warwick from April 20 till April 22 was quite enjoyable [says most objectively one of the organisers!], with all speakers present to deliver their talks  (!) and around sixty participants, including 17 posters. It remains a exciting aspect of the field that so many and so different perspectives are available on the “doubly intractable” problem of estimating a normalising constant. Several talks and posters concentrated on Ising models, which always sound a bit artificial to me, but also are perfect testing grounds for approximations to classical algorithms.

On top of [clearly interesting!] talks associated with papers I had already read [and commented here], I had not previously heard about Pierre Jacob’s coupling SMC sequence, which paper is not yet out [no spoiler then!]. Or about Michael Betancourt’s adiabatic Monte Carlo and its connection with the normalising constant. Nicolas Chopin talked about the unnormalised Poisson process I discussed a while ago, with this feature that the normalising constant itself becomes an additional parameter. And that integration can be replaced with (likelihood) maximisation. The approach, which is based on a reference distribution (and an artificial logistic regression à la Geyer), reminded me of bridge sampling. And indirectly of path sampling, esp. when Merrilee Hurn gave us a very cool introduction to power posteriors in the following talk. Also mentioning the controlled thermodynamic integration of Chris Oates and co-authors I discussed a while ago. (Too bad that Chris Oates could not make it to this workshop!) And also pointing out that thermodynamic integration could be a feasible alternative to nested sampling.

Another novel aspect was found in Yves Atchadé’s talk about sparse high-dimension matrices with priors made of mutually exclusive measures and quasi-likelihood approximations. A simplified version of the talk being in having a non-identified non-constrained matrix later projected onto one of those measure supports. While I was aware of his noise-contrastive estimation of normalising constants, I had not previously heard Michael Gutmann give a talk on that approach (linking to Geyer’s 1994 mythical paper!). And I do remain nonplussed at the possibility of including the normalising constant as an additional parameter [in a computational and statistical sense]..! Both Chris Sherlock and Christophe Andrieu talked about novel aspects on pseudo-marginal techniques, Chris on the lack of variance reduction brought by averaging unbiased estimators of the likelihood and Christophe on the case of large datasets, recovering better performances in latent variable models by estimating the ratio rather than taking a ratio of estimators. (With Christophe pointing out that this was an exceptional case when harmonic mean estimators could be considered!)

Bayesian model averaging in astrophysics [guest post]

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , on August 12, 2015 by xi'an

.[Following my posting of a misfiled 2013 blog, Ewan Cameron told me of the impact of this paper in starting his own blog and I asked him for a guest post, resulting in this analysis, much deeper than mine. No warning necessary this time!]

Back in February 2013 when “Bayesian Model Averaging in Astrophysics: A Review” by Parkinson & Liddle (hereafter PL13) first appeared on the arXiv I was a keen, young(ish) postdoc eager to get stuck into debates about anything and everything ‘astro-statistical’. And with its seemingly glaring flaws, PL13 was more grist to the mill. However, despite my best efforts on various forums I couldn’t get a decent fight started over the right way to do model averaging (BMA) in astronomy, so out of sheer frustration two months later I made my own soapbox to shout from at Another Astrostatistics Blog. Having seen PL13 reviewed recently here on Xian’s Og it feels like the right time to revisit the subject and reflect on where BMA in astronomy is today.

As pointed out to me back in 2013 by Tom Loredo, the act of Bayesian model averaging has been around much longer than its name; indeed an early astronomical example appears in Gregory & Loredo (1992) in which the posterior mean representation of an unknown signal is constructed for an astronomical “light-curve”, averaging over a set of constant and periodic candidate models. Nevertheless the wider popularisation of model averaging in astronomy has only recently taken place through a variety of applications in cosmology: e.g. Liddle, Mukherjee, Parkinson & Wang (2006) and Vardanyan, Trotta & Silk (2011).

In contrast to earlier studies like Gregory & Loredo (1992)—or the classic review on BMA by Hoeting et al. (1999)—in which the target of model averaging is typically either a utility function, a set of future observations, or a latent parameter of the observational process (e.g. the unknown “light-curve” shape) shared naturally by all competing models, the proposal of cosmological BMA studies is to produce a model-averaged version of the posterior for a given ‘shared’ parameter: a so-called “model-averaged PDF”. This proposal didn’t sit well with me back in 2013, and it still doesn’t sit well with me today. Philosophically: without a model a parameter has no meaning, so why should we want to seek meaning in the marginalised distribution of a parameter over an entire set of models? And, practically: to put it another way, without knowing the model ‘label’ to which a given mass of model-averaged parameter probability belongs there’s nothing much useful we can do with this ‘PDF’: nothing much we can say about the data we’ve just analysed and nothing much we can say about future experiments. Whereas the space of the observed data is shared automatically by all competing models, it seems to me to be somehow “un-Bayesian” to place the further restriction that the parameters of separate models share the same scale and topology. I say “un-Bayesian” since this mode of model averaging suggests a formulation of the parameter space + prior pairing stronger than the statement of one’s prior beliefs for the distribution of observable data given the model. But I would be happy to hear arguments from the other side in the comments box below … ! Continue reading