## sandwiching a marginal

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , on March 8, 2021 by xi'an When working recently on a paper for estimating the marginal likelihood, I was pointed out this earlier 2015 paper by Roger Grosse, Zoubin Ghahramani and Ryan Adams, which had escaped till now. The beginning of the paper discusses the shortcomings of importance sampling (when simulating from the prior) and harmonic mean (when simulating from the posterior) as solution. And of anNealed importance sampling (when simulating from a sequence, which sequence?!, of targets). The authors are ending up proposing a sequential Monte Carlo or (posterior) particle learning solution. A remark on annealed importance sampling is that there exist both a forward and a backward version for estimating the marginal likelihood, either starting from a simulation from the prior (easy) or from a simulation from the posterior (hard!). As in, e.g., Nicolas Chopin’s thesis, the intermediate steps are constructed from a subsample of the entire sample.

In this context, unbiasedness can be misleading: because partition function estimates can vary over many orders of magnitude, it’s common for an unbiased estimator to drastically underestimate Ζ with overwhelming probability, yet occasionally return extremely large estimates. (An extreme example is likelihood weighting, which is unbiased, but is extremely unlikely to give an accurate answer for a high-dimensional model.) Unless the estimator is chosen very carefully, the variance is likely to be extremely large, or even infinite.”

One novel aspect of the paper is to advocate for the simultaneous use of different methods and for producing both lower and upper bounds on the marginal p(y) and wait for them to get close enough. It is however delicate to find upper bounds, except when using the dreaded harmonic mean estimator.  (A nice trick associated with reverse annealed importance sampling is that the reverse chain can be simulated exactly from the posterior if associated with simulated data, except I am rather lost at the connection between the actual and simulated data.) In a sequential harmonic mean version, the authors also look at the dangers of using an harmonic mean but argue the potential infinite variance of the weights does not matter so much for log p(y), without displaying any variance calculation… The paper also contains a substantial experimental section that compares the different solutions evoked so far, plus others like nested sampling. Which did not work poorly in the experiment (see below) but could not be trusted to provide a lower or an upper bound. The computing time to achieve some level of agreement is however rather daunting. An interesting read definitely (and I wonder what happened to the paper in the end). ## marginal likelihood with large amounts of missing data

Posted in Books, pictures, Statistics with tags , , , , , , , , on October 20, 2020 by xi'an

In 2018, Panayiota Touloupou, research fellow at Warwick, and her co-authors published a paper in Bayesian analysis that somehow escaped my radar, despite standing in my first circle of topics of interest! They construct an importance sampling approach to the approximation of the marginal likelihood, the importance function being approximated from a preliminary MCMC run, and consider the special case when the sampling density (i.e., the likelihood) can be represented as the marginal of a joint density. While this demarginalisation perspective is rather usual, the central point they make is that it is more efficient to estimate the sampling density based on the auxiliary or latent variables than to consider the joint posterior distribution of parameter and latent in the importance sampler. This induces a considerable reduction in dimension and hence explains (in part) why the approach should prove more efficient. Even though the approximation itself is costly, at about 5 seconds per marginal likelihood. But a nice feature of the paper is to include the above graph that includes both computing time and variability for different methods (the blue range corresponding to the marginal importance solution, the red range to RJMCMC and the green range to Chib’s estimate). Note that bridge sampling does not appear on the picture but returns a variability that is similar to the proposed methodology.

## wrapped Normal distribution

Posted in Books, R, Statistics with tags , , , , , on April 14, 2020 by xi'an

One version of the wrapped Normal distribution on (0,1) is expressed as a sum of Normal distributions with means shifted by all relative integers $\psi(x;\mu,\sigma)=\sum_{k\in\mathbb Z}\varphi(x;\mu+k,\sigma)\mathbb I_{(0,1)}(x)$

which, while a parameterised density, has imho no particular statistical appeal over the use of other series. It was nonetheless the centre of a series of questions on X validated in the past weeks. Curiously used as the basis of a random walk type move over the unit cube along with a uniform component. Simulating from this distribution is easily done when seeing it as an infinite mixture of truncated Normal distributions, since the weights are easily computed $\sum_{k\in\mathbb Z}\overbrace{[\Phi_\sigma(1-\mu-k)-\Phi_\sigma(-\mu-k)]}^{p_k(\mu,\sigma)}\times$ $\dfrac{\varphi_\sigma(x-\mu-k)\mathbb I_{(0,1)}(y)}{\Phi_\sigma(1-\mu-k)-\Phi_\sigma(-\mu-k)}$

Hence coding simulations as

wrap<-function(x, mu, sig){
ter = trunc(5*sig + 1)
return(sum(dnorm(x + (-ter):ter, mu, sig)))}
siw = function(N=1e4,beta=.5,mu,sig){
unz = (runif(N)<beta)
ter = trunc(5*sig + 1)
qrbz = diff(prbz<-pnorm(-mu + (-ter):ter, sd=sig))
ndx = sample((-ter+1):ter,N,rep=TRUE,pr=qrbz)+ter
z = sig*qnorm(prbz[ndx]+runif(N)*qrbz[ndx])-ndx+mu+ter+1
return(c(runif(sum(unz)),z[!unz]))}


and checking that the harmonic mean estimator was functioning for this density, predictably since it is lower bounded on (0,1). The prolix originator of the question was also wondering at the mean of the wrapped Normal distribution, which I derived as (predictably) $\mu+\sum_{k\in\mathbb Z} kp_k(x,\mu,\sigma)$

but could not simplify any further except for x=0,½,1, when it is ½. A simulated evaluation of the mean as a function of μ shows a vaguely sinusoidal pattern, also predictably periodic and unsurprisingly antisymmetric, and apparently independent of the scale parameter σ… ## an arithmetic mean identity

Posted in Books, pictures, R, Statistics, Travel, University life with tags , , , , , , , , , , , , on December 19, 2019 by xi'an A 2017 paper by Ana Pajor published in Bayesian Analysis addresses my favourite problem [of computing the marginal likelihood] and which I discussed on the ‘Og, linking with another paper by Lenk published in 2012 in JCGS. That I already discussed here last year. Lenk’s (2009) paper is actually using a technique related to the harmonic mean correction based on HPD regions Darren Wraith and myself proposed at MaxEnt 2009. And which Jean-Michel and I presented at Frontiers of statistical decision making and Bayesian analysis in 2010. As I had only vague memories about the arithmetic mean version, we discussed the paper together with graduate students in Paris Dauphine.

The arithmetic mean solution, representing the marginal likelihood as the prior average of the likelihood, is a well-known approach used as well as the basis for nested sampling. With the improvement consisting in restricting the simulation to a set Ð with sufficiently high posterior probability. I am quite uneasy about P(Ð|y) estimated by 1 as the shape of the set containing all posterior simulations is completely arbitrary, parameterisation dependent, and very random since based on the extremes of this posterior sample. Plus, the set Ð converges to the entire parameter space with the number of posterior simulations. An alternative that we advocated in our earlier paper is to take Ð as the HPD region or a variational Bayes version . But the central issue with the HPD regions is how to construct these from an MCMC output and how to compute both P(Ð) and P(Ð|y). It does not seem like a good idea to set P(Ð|x) to the intended α level for the HPD coverage. Using a non-parametric version for estimating Ð could be in the end the only reasonable solution.

As a test, I reran the example of a conjugate normal model used in the paper, based on (exact) simulations from both the prior and  the posterior, and obtained approximations that were all close from the true marginal. With Chib’s being exact in that case (of course!), and an arithmetic mean surprisingly close without an importance correction:

> print(c(hame,chme,came,chib))
 -107.6821 -106.5968 -115.5950 -115.3610


Both harmonic versions are of the right order but not trustworthy, the truncation to such a set Ð as the one chosen in this paper having little impact.

## 19 dubious ways to compute the marginal likelihood

Posted in Books, Statistics with tags , , , , , , , , , , on December 11, 2018 by xi'an A recent arXival on nineteen different [and not necessarily dubious!] ways to approximate the marginal likelihood of a given topology of a philogeny tree that reminded me of our San Antonio survey with Jean-Michel Marin. This includes a version of the Laplace approximation called Laplus (!), accounting for the fact that branch lengths on the tree are positive but may have a MAP at zero. Using a Beta, Gamma, or log-Normal distribution instead of a Normal. For importance sampling, the proposals are derived from either the Laplus (!) approximate distributions or from the variational Bayes solution (based on an Normal product). Harmonic means are still used here despite the obvious danger, along with a defensive version that mixes prior and posterior. Naïve Monte Carlo means simulating from the prior, while bridge sampling seems to use samples from prior and posterior distributions. Path and modified path sampling versions are those proposed in 2008 by Nial Friel and Tony Pettitt (QUT). Stepping stone sampling appears like another version of path sampling, also based on a telescopic product of ratios of normalising constants, the generalised version relying on a normalising reference distribution that need be calibrated. CPO and PPD in the above table are two versions based on posterior predictive density estimates. When running the comparison between so many contenders, the ground truth is selected as the values returned by MrBayes in a massive MCMC experiment amounting to 7.5 billions generations. For five different datasets. The above picture describes mean square errors for the probabilities of split, over ten replicates [when meaningful], the worst case being naïve Monte Carlo, with nested sampling and harmonic mean solutions close by. Similar assessments proceed from a comparison of Kullback-Leibler divergences. With the (predicatble?) note that “the methods do a better job approximating the marginal likelihood of more probable trees than less probable trees”. And massive variability for the poorest methods: The comparison above does not account for time and since some methods are deterministic (and fast) there is little to do about this. The stepping steps solutions are very costly, while on the middle range bridge sampling outdoes path sampling. The assessment of nested sampling found in the conclusion is that it “would appear to be an unwise choice for estimating the marginal likelihoods of topologies, as it produces poor approximate posteriors” (p.12). Concluding at the Gamma Laplus approximation being the winner across all categories! (There is no ABC solution studied in this paper as the model likelihood can be computed in this setup, contrary to our own setting.)