## [more than] everything you always wanted to know about marginal likelihood

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , , , on February 10, 2022 by xi'an

Earlier this year, F. Llorente, L. Martino, D. Delgado, and J. Lopez-Santiago have arXived an updated version of their massive survey on marginal likelihood computation. Which I can only warmly recommend to anyone interested in the matter! Or looking for a base camp to initiate a graduate project. They break the methods into four families

1. Deterministic approximations (e.g., Laplace approximations)
2. Methods based on density estimation (e.g., Chib’s method, aka the candidate’s formula)
3. Importance sampling, including sequential Monte Carlo, with a subsection connecting with MCMC
4. Vertical representations (mostly, nested sampling)

Besides sheer computation, the survey also broaches upon issues like improper priors and alternatives to Bayes factors. The parts I would have done in more details are reversible jump MCMC and the long-lasting impact of Geyer’s reverse logistic regression (with the noise contrasting extension), even though the link with bridge sampling is briefly mentioned there. There is even a table reporting on the coverage of earlier surveys. Of course, the following postnote of the manuscript

The Christian Robert’s blog deserves a special mention , since Professor C. Robert has devoted several entries of his blog with very interesting comments regarding the marginal likelihood estimation and related topics.

does not in the least make me less objective! Some of the final recommendations

• use of Naive Monte Carlo [simulate from the prior] should be always considered [assuming a proper prior!]
• a multiple-try method is a good choice within the MCMC schemes
• optimal umbrella sampling estimator is difficult and costly to implement , so its best performance may not be achieved in practice
• adaptive importance sampling uses the posterior samples to build a suitable normalized proposal, so it benefits from localizing samples in regions of high posterior probability while preserving the properties of standard importance sampling
• Chib’s method is a good alternative, that provide very good performances [but is not always available]
• the success [of nested sampling] in the literature is surprising.

## approximation of Bayes Factors via mixing

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on December 21, 2020 by xi'an

A [new version of a] paper by Chenguang Dai and Jun S. Liu got my attention when it appeared on arXiv yesterday. Due to its title which reminded me of a solution to the normalising constant approximation that we proposed in the 2010 nested sampling evaluation paper we wrote with Nicolas. Recovering bridge sampling—mentioned by Dai and Liu as an alternative to their approach rather than an early version—by a type of Charlie Geyer (1990-1994) trick. (The attached slides are taken from my MCMC graduate course, with a section on the approximation of Bayesian normalising constants I first wrote for a short course at Jim Berger’s 70th anniversary conference, in San Antonio.)

A difference with the current paper is that the authors “form a mixture distribution with an adjustable mixing parameter tuned through the Wang-Landau algorithm.” While we chose it by hand to achieve sampling from both components. The weight is updated by a simple (binary) Wang-Landau version, where the partition is determined by which component is simulated, ie by the mixture indicator auxiliary variable. Towards using both components on an even basis (à la Wang-Landau) and stabilising the resulting evaluation of the normalising constant. More generally, the strategy applies to a sequence of surrogate densities, which are chosen by variational approximations in the paper.

## neural importance sampling

Posted in Books, Kids, pictures, Statistics, University life with tags , , , , , , , , , , on May 13, 2020 by xi'an

Dennis Prangle signaled this paper during his talk of last week, first of our ABC ‘minars now rechristened as The One World ABC Seminar to join the “One World xxx Seminar” franchise! The paper is written by Thomas Müller and co-authors, all from Disney research [hence the illustration], and we discussed it in our internal reading seminar at Dauphine. The authors propose to parameterise the importance sampling density via neural networks, just like Dennis is using auto-encoders. Starting with the goal of approximating

$\mathfrak I=\int_{\mathfrak D} f(x)\text{d}x$

(where they should assume f to be non-negative for the following), the authors aim at simulating from an approximation of f(x)/ℑ since this “ideal” pdf would give zero variance.

“Unfortunately, the above integral is often not solvable in closed form, necessitating its estimation with another Monte Carlo estimator.”

Among the discussed solutions, the Latent-Variable Model one is based on a pdf represented as a marginal. A mostly intractable integral, which the authors surprisingly seem to deem an issue as they do not mention the standard solution of simulating from the joint and using the conditional in the importance weight. (Or even more surprisingly and obviously wrongly see the latter as a biased approximation to the weight.)

“These “autoregressive flows” offer the desired exact evaluation of q(x;θ). Unfortunately, they generally only permit either efficient sample generation or efficient evaluation of q(x;θ), which makes them prohibitively expensive for our application to Mont Carlo integration.”

When presenting normalizing flows, namely the representation of the simulation output as the result of an invertible mapping of a standard (e.g., Gaussian or Uniform) random variable, x=h(u,θ), which can itself be decomposed into a composition of suchwise functions. And I am thus surprised this cannot be done in an efficient manner if transforms are well chosen…

“The key proposition of Dinh et al. (2014) is to focus on a specific class of mappings—referred to as coupling layers—that admit Jacobian matrices where determinants reduce to the product of diagonal terms.

Using a transform with a triangular Jacobian at each stage has the appeal of keeping the change of variable simple and allowing for non-linear transforms. Namely piecewise polynomials. When reading the one-blob (!) encoding , I am however uncertain the approach is more than the choice of a particular functional basis, as for instance wavelets (which may prove more costly to handle, granted!)

“Given that NICE scales well to high-dimensional problems…”

It is always unclear to me why almost every ML paper feels the urge to redefine & motivate the KL divergence. And to recall that it avoids bothering about the normalising constant. Looking at the variance of the MC estimator & seeking minimal values is praiseworthy, but only when the variance exists. What are the guarantees on the density estimate for this to happen? And where are the arguments for NICE scaling nicely to high dimensions? Interesting intrusion of path sampling, but is it of any use outside image analysis—I had forgotten Eric Veach’s original work was on light transport—?

## 19 dubious ways to compute the marginal likelihood

Posted in Books, Statistics with tags , , , , , , , , , , on December 11, 2018 by xi'an

A recent arXival on nineteen different [and not necessarily dubious!] ways to approximate the marginal likelihood of a given topology of a philogeny tree that reminded me of our San Antonio survey with Jean-Michel Marin. This includes a version of the Laplace approximation called Laplus (!), accounting for the fact that branch lengths on the tree are positive but may have a MAP at zero. Using a Beta, Gamma, or log-Normal distribution instead of a Normal. For importance sampling, the proposals are derived from either the Laplus (!) approximate distributions or from the variational Bayes solution (based on an Normal product). Harmonic means are still used here despite the obvious danger, along with a defensive version that mixes prior and posterior. Naïve Monte Carlo means simulating from the prior, while bridge sampling seems to use samples from prior and posterior distributions. Path and modified path sampling versions are those proposed in 2008 by Nial Friel and Tony Pettitt (QUT). Stepping stone sampling appears like another version of path sampling, also based on a telescopic product of ratios of normalising constants, the generalised version relying on a normalising reference distribution that need be calibrated. CPO and PPD in the above table are two versions based on posterior predictive density estimates.

When running the comparison between so many contenders, the ground truth is selected as the values returned by MrBayes in a massive MCMC experiment amounting to 7.5 billions generations. For five different datasets. The above picture describes mean square errors for the probabilities of split, over ten replicates [when meaningful], the worst case being naïve Monte Carlo, with nested sampling and harmonic mean solutions close by. Similar assessments proceed from a comparison of Kullback-Leibler divergences. With the (predicatble?) note that “the methods do a better job approximating the marginal likelihood of more probable trees than less probable trees”. And massive variability for the poorest methods:

The comparison above does not account for time and since some methods are deterministic (and fast) there is little to do about this. The stepping steps solutions are very costly, while on the middle range bridge sampling outdoes path sampling. The assessment of nested sampling found in the conclusion is that it “would appear to be an unwise choice for estimating the marginal likelihoods of topologies, as it produces poor approximate posteriors” (p.12). Concluding at the Gamma Laplus approximation being the winner across all categories! (There is no ABC solution studied in this paper as the model likelihood can be computed in this setup, contrary to our own setting.)

## computational statistics and molecular simulation [18w5023]

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , on November 19, 2018 by xi'an

The last day of the X fertilisation workshop at the casa matematicà Oaxaca, there were only three talks and only half of the participants. I lost the subtleties of the first talk by Andrea Agazzi on large deviations for chemical reactions, due to an emergency at work (Warwick). The second talk by Igor Barahona was somewhat disconnected from the rest of the conference, working on document textual analysis by way of algebraic data analysis (analyse des données) methods à la Benzécri. (Who was my office neighbour at Jussieu in the early 1990s.) In the last and final talk, Eric Vanden-Eijden made a link between importance sampling and PDMP, as an integral can be expressed via a trajectory of a path. A generalisation of path sampling, for almost any ODE. But also a competitor to nested sampling, waiting for the path to reach an Hamiltonian level, without some of the difficulties plaguing nested sampling like resampling. And involving continuous time processes. (Is there a continuous time version of ABC as well?!) Returning unbiased estimators of mean (the original integral) and variance. Example of a mixture example in dimension d=10 with k=50 components using only 100 paths.