## Bayesian model comparison with intractable constants

Posted in Books, Kids, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , on February 8, 2016 by xi'an

Richard Everitt, Adam Johansen (Warwick), Ellen Rowing and Melina Evdemon-Hogan have updated [on arXiv] a survey paper on the computation of Bayes factors in the presence of intractable normalising constants. Apparently destined for Statistics and Computing when considering the style. A great entry, in particular for those attending the CRiSM workshop Estimating Constants in a few months!

A question that came to me from reading the introduction to the paper is why a method like Møller et al.’s (2006) auxiliary variable trick should be considered more “exact” than the pseudo-marginal approach of Andrieu and Roberts (2009) since the later can equally be seen as an auxiliary variable approach. The answer was on the next page (!) as it is indeed a special case of Andrieu and Roberts (2009). Murray et al. (2006) also belongs to this group with a product-type importance sampling estimator, based on a sequence of tempered intermediaries… As noted by the authors, there is a whole spectrum of related methods in this area, some of which qualify as exact-approximate, inexact approximate and noisy versions.

Their main argument is to support importance sampling as the method of choice, including sequential Monte Carlo (SMC) for large dimensional parameters. The auxiliary variable of Møller et al.’s (2006) is then part of the importance scheme. In the first toy example, a Poisson is opposed to a Geometric distribution, as in our ABC model choice papers, for which a multiple auxiliary variable approach dominates both ABC and Simon Wood’s synthetic likelihood for a given computing cost. I did not spot which artificial choice was made for the Z(θ)’s in both models, since the constants are entirely known in those densities. A very interesting section of the paper is when envisioning biased approximations to the intractable density. If only because the importance weights are most often biased due to the renormalisation (possibly by resampling). And because the variance derivations are then intractable as well. However, due to this intractability, the paper can only approach the impact of those approximations via empirical experiments. This leads however to the interrogation on how to evaluate the validity of the approximation in settings where truth and even its magnitude are unknown… Cross-validation and bootstrap type evaluations may prove too costly in realistic problems. Using biased solutions thus mostly remains an open problem in my opinion.

The SMC part in the paper is equally interesting if only because it focuses on the data thinning idea studied by Chopin (2002) and many other papers in the recent years. This made me wonder why an alternative relying on a sequence of approximations to the target with tractable normalising constants could not be considered. A whole sequence of auxiliary variable completions sounds highly demanding in terms of computing budget and also requires a corresponding sequence of calibrations. (Now, ABC fares no better since it requires heavy simulations and repeated calibrations, while further exhibiting a damning missing link with the target density. ) Unfortunately, embarking upon a theoretical exploration of the properties of approximate SMC is quite difficult, as shown by the strong assumptions made in the paper to bound the total variation distance to the true target.

## love-hate Metropolis algorithm

Posted in Books, pictures, R, Statistics, Travel with tags , , , , , , , , , on January 28, 2016 by xi'an

Hyungsuk Tak, Xiao-Li Meng and David van Dyk just arXived a paper on a multiple choice proposal in Metropolis-Hastings algorithms towards dealing with multimodal targets. Called “A repulsive-attractive Metropolis algorithm for multimodality” [although I wonder why XXL did not jump at the opportunity to use the “love-hate” denomination!]. The proposal distribution includes a [forced] downward Metropolis-Hastings move that uses the inverse of the target density π as its own target, namely 1/{π(x)+ε}. Followed by a [forced] Metropolis-Hastings upward move which target is {π(x)+ε}. The +ε is just there to avoid handling ratios of zeroes (although I wonder why using the convention 0/0=1 would not work). And chosen as 10⁻³²³ by default in connection with R smallest positive number. Whether or not the “downward” move is truly downwards and the “upward” move is truly upwards obviously depends on the generating distribution: I find it rather surprising that the authors consider the same random walk density in both cases as I would have imagined relying on a more dispersed distribution for the downward move in order to reach more easily other modes. For instance, the downward move could have been based on an anti-Langevin proposal, relying on the gradient to proceed further down…

This special choice of a single proposal however simplifies the acceptance ratio (and keeps the overall proposal symmetric). The final acceptance ratio still requires a ratio of intractable normalising constants that the authors bypass by Møller et al. (2006) auxiliary variable trick. While the authors mention the alternative pseudo-marginal approach of Andrieu and Roberts (2009), they do not try to implement it, although this would be straightforward here: since the normalising constants are the probabilities of accepting a downward and an upward move, respectively. Those can easily be evaluated at a cost similar to the use of the auxiliary variables. That is,

– generate a few moves from the current value and record the proportion p of accepted downward moves;
– generate a few moves from the final proposed value and record the proportion q of accepted downward moves;

and replace the ratio of intractable normalising constants with p/q. It is not even clear that one needs those extra moves since the algorithm requires an acceptance in the downward and upward moves, hence generate Geometric variates associated with those probabilities p and q, variates that can be used for estimating them. From a theoretical perspective, I also wonder if forcing the downward and upward moves truly leads to an improved convergence speed. Considering the case when the random walk is poorly calibrated for either the downward or upward move, the number of failed attempts before an acceptance may get beyond the reasonable.

As XXL and David pointed out to me, the unusual aspect of the approach is that here the proposal density is intractable, rather than the target density itself. This makes using Andrieu and Roberts (2009) seemingly less straightforward. However, as I was reminded this afternoon at the statistics and probability seminar in Bristol, the argument for the pseudo-marginal based on an unbiased estimator is that w Q(w|x) has a marginal in x equal to π(x) when the expectation of w is π(x). In thecurrent problem, the proposal in x can extended into a proposal in (x,w), w P(w|x) whose marginal is the proposal on x.

If we complement the target π(x) with the conditional P(w|x), the acceptance probability would then involve

{π(x’) P(w’|x’) / π(x) P(w|x)} / {w’ P(w’|x’) / w P(w|x)} = {π(x’) / π(x)} {w/w’}

so it seems the pseudo-marginal (or auxiliary variable) argument also extends to the proposal. Here is a short experiment that shows no discrepancy between target and histogram:

nozero=1e-300
#love-hate move
move<-function(x){
bacwa=1;prop1=prop2=rnorm(1,x,2)
while (runif(1)>{pi(x)+nozero}/{pi(prop1)+nozero}){
prop1=rnorm(1,x,2);bacwa=bacwa+1}
while (runif(1)>{pi(prop2)+nozero}/{pi(prop1)+nozero})
prop2=rnorm(1,prop1,2)
y=x
if (runif(1)<pi(prop2)*bacwa/pi(x)/fowa){
y=prop2;assign("fowa",bacwa)}
return(y)}
#arbitrary bimodal target
pi<-function(x){.25*dnorm(x)+.75*dnorm(x,mean=5)}
#running the chain
T=1e5
x=5*rnorm(1);luv8=rep(x,T)
fowa=1;prop1=rnorm(1,x,2) #initial estimate
while (runif(1)>{pi(x)+nozero}/{pi(prop1)+nozero}){
fowa=fowa+1;prop1=rnorm(1,x,2)}
for (t in 2:T)
luv8[t]=move(luv8[t-1])


## CRiSM workshop on estimating constants [#1]

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , on January 19, 2016 by xi'an

The registration for the CRiSM workshop on estimating constants that Nial Friel, Helen Ogden and myself host next April 20-22 at the University of Warwick is now open. The plain registration fees are £40 and accommodation on the campus is available through the same form.

Since besides the invited talks, the workshop will host two poster session with speed (2-5mn) oral presentations, we encourage all interested researchers to submit a poster via the appropriate form. Once again, this should be an exciting two-day workshop, given the on-going activity in this area.

## rediscovering the harmonic mean estimator

Posted in Kids, Statistics, University life with tags , , , , , , , on November 10, 2015 by xi'an

When looking at unanswered questions on X validated, I came across a question where the author wanted to approximate a normalising constant

$N=\int g(x)\,\text{d}x\,,$

while simulating from the associated density, g. While seemingly unaware of the (huge) literature in the area, he re-derived [a version of] the harmonic mean estimate by considering the [inverted importance sampling] identity

$\int_\mathcal{X} \dfrac{\alpha(x)}{g(x)}p(x) \,\text{d}x=\int_\mathcal{X} \dfrac{\alpha(x)}{N} \,\text{d}x=\dfrac{1}{N}$

when α is a probability density and by using for α the uniform over the whole range of the simulations from g. This choice of α obviously leads to an estimator with infinite variance when the support of g is unbounded, but the idea can be easily salvaged by using instead another uniform distribution, for instance on an highest density region, as we studied in our papers with Darren Wraith and Jean-Michel Marin. (Unfortunately, the originator of the question does not seem any longer interested in the problem.)

## importance sampling with multiple MCMC sequences

Posted in Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , , , on October 2, 2015 by xi'an

Vivek Roy, Aixian Tan and James Flegal arXived a new paper, Estimating standard errors for importance sampling estimators with multiple Markov chains, where they obtain a central limit theorem and hence standard error estimates when using several MCMC chains to simulate from a mixture distribution as an importance sampling function. Just before I boarded my plane from Amsterdam to Calgary, which gave me the opportunity to read it completely (along with half a dozen other papers, since it is a long flight!) I first thought it was connecting to our AMIS algorithm (on which convergence Vivek spent a few frustrating weeks when he visited me at the end of his PhD), because of the mixture structure. This is actually altogether different, in that a mixture is made of unnormalised complex enough densities, to act as an importance sampler, and that, due to this complexity, the components can only be simulated via separate MCMC algorithms. Behind this characterisation lurks the challenging problem of estimating multiple normalising constants. The paper adopts the resolution by reverse logistic regression advocated in Charlie Geyer’s famous 1994 unpublished technical report. Beside the technical difficulties in establishing a CLT in this convoluted setup, the notion of mixing importance sampling and different Markov chains is quite appealing, especially in the domain of “tall” data and of splitting the likelihood in several or even many bits, since the mixture contains most of the information provided by the true posterior and can be corrected by an importance sampling step. In this very setting, I also think more adaptive schemes could be found to determine (estimate?!) the optimal weights of the mixture components.

## on estimating constants…

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , on July 21, 2015 by xi'an

While I discussed on the ‘Og in the past the difference I saw between estimating an unknown parameter from a distribution and evaluating a normalising constant, evaluating such constants and hence handling [properly] doubly intractable models is obviously of the utmost importance! For this reason, Nial Friel, Helen Ogden and myself have put together a CRiSM workshop on the topic (with the tongue-in-cheek title of Estimating constants!), to be held at the University of Warwick next April 20-22.

The CRiSM workshop will focus on computational methods for approximating challenging normalising constants found in Monte Carlo, likelihood and Bayesian models. Such methods may be used in a wide range of problems: to compute intractable likelihoods, to find the evidence in Bayesian model selection, and to compute the partition function in Physics. The meeting will bring together different communities working on these related problems, some of which have developed original if little advertised solutions. It will also highlight the novel challenges associated with large data and highly complex models. Besides a dozen invited talks, the schedule will highlight two afternoon poster sessions with speed (2-5mn) oral presentations called ‘Elevator’ talks.

While 2016 is going to be quite busy with all kinds of meetings (MCMSkv, ISBA 2016, the CIRM Statistics month, AISTATS 2016, …), this should be an exciting two-day workshop, given the on-going activity in this area, and I thus suggest interested readers to mark the dates in their diary. I will obviously keep you posted about registration and accommodation when those entries are available.

## vertical likelihood Monte Carlo integration

Posted in Books, pictures, Running, Statistics, Travel, University life with tags , , , , , , , on April 17, 2015 by xi'an

A few months ago, Nick Polson and James Scott arXived a paper on one of my favourite problems, namely the approximation of normalising constants (and it went way under my radar, as I only became aware of it quite recently!, then it remained in my travel bag for an extra few weeks…). The method for approximating the constant Z draws from an analogy with the energy level sampling methods found in physics, like the Wang-Landau algorithm. The authors rely on a one-dimensional slice sampling representation of the posterior distribution and [main innovation in the paper] add a weight function on the auxiliary uniform. The choice of the weight function links the approach with the dreaded harmonic estimator (!), but also with power-posterior and bridge sampling. The paper recommends a specific weighting function, based on a “score-function heuristic” I do not get. Further, the optimal weight depends on intractable cumulative functions as in nested sampling. It would be fantastic if one could draw directly from the prior distribution of the likelihood function—rather than draw an x [from the prior or from something better, as suggested in our 2009 Biometrika paper] and transform it into L(x)—but as in all existing alternatives this alas is not the case. (Which is why I find the recommendations in the paper for practical implementation rather impractical, since, were the prior cdf of L(X) available, direct simulation of L(X) would be feasible. Maybe not the optimal choice though.)

“What is the distribution of the likelihood ordinates calculated via nested sampling? The answer is surprising: it is essentially the same as the distribution of likelihood ordinates by recommended weight function from Section 4.”

The approach is thus very much related to nested sampling, at least in spirit. As the authors later demonstrate, nested sampling is another case of weighting, Both versions require simulations under truncated likelihood values. Albeit with a possibility of going down [in likelihood values] with the current version. Actually, more weighting could prove [more] efficient as both the original nested and vertical sampling simulate from the prior under the likelihood constraint. Getting away from the prior should help. (I am quite curious to see how the method is received and applied.)