**V**ivek Roy, Aixian Tan and James Flegal arXived a new paper, Estimating standard errors for importance sampling estimators with multiple Markov chains, where they obtain a central limit theorem and hence standard error estimates when using several MCMC chains to simulate from a mixture distribution as an importance sampling function. Just before I boarded my plane from Amsterdam to Calgary, which gave me the opportunity to read it completely (along with half a dozen other papers, since it is a long flight!) I first thought it was connecting to our AMIS algorithm (on which convergence Vivek spent a few frustrating weeks when he visited me at the end of his PhD), because of the mixture structure. This is actually altogether different, in that a mixture is made of unnormalised complex enough densities, to act as an importance sampler, and that, due to this complexity, the components can only be simulated via separate MCMC algorithms. Behind this characterisation lurks the challenging problem of estimating multiple normalising constants. The paper adopts the resolution by reverse logistic regression advocated in Charlie Geyer’s famous 1994 unpublished technical report. Beside the technical difficulties in establishing a CLT in this convoluted setup, the notion of mixing importance sampling and different Markov chains is quite appealing, especially in the domain of “tall” data and of splitting the likelihood in several or even many bits, since the mixture contains most of the information provided by the true posterior and can be corrected by an importance sampling step. In this very setting, I also think more adaptive schemes could be found to determine (estimate?!) the optimal weights of the mixture components.

## Archive for importance sampling

## importance sampling with multiple MCMC sequences

Posted in Mountains, pictures, Statistics, Travel, University life with tags adaptive MCMC, Ames, AMIS, Amsterdam, Charlie Geyer, importance sampling, Iowa, MCMC, Monte Carlo Statistical Methods, normalising constant, splitting data on October 2, 2015 by xi'an## Bayesian model averaging in astrophysics [guest post]

Posted in Books, pictures, Statistics, University life with tags astrostatistics, cosmology, Dickey-Savage ratio, guest post, harmonic mean estimator, importance sampling, marginal likelihood, model averaging, Monte Carlo approximations, nested sampling, path sampling, thermodynamic catastrophe, thermodynamic integration, utility function on August 12, 2015 by xi'an.*[Following my posting of a misfiled 2013 blog, Ewan Cameron told me of the impact of this paper in starting his own blog and I asked him for a guest post, resulting in this analysis, much deeper than mine. No warning necessary this time!]*

**B**ack in February 2013 when “Bayesian Model Averaging in Astrophysics: A Review” by Parkinson & Liddle (hereafter PL13) first appeared on the arXiv I was a keen, young(ish) postdoc eager to get stuck into debates about anything and everything ‘astro-statistical’. And with its seemingly glaring flaws, PL13 was more grist to the mill. However, despite my best efforts on various forums I couldn’t get a decent fight started over the right way to do model averaging (BMA) in astronomy, so out of sheer frustration two months later I made my own soapbox to shout from at Another Astrostatistics Blog. Having seen PL13 reviewed recently here on Xian’s Og it feels like the right time to revisit the subject and reflect on where BMA in astronomy is today.

As pointed out to me back in 2013 by Tom Loredo, the act of Bayesian model averaging has been around much longer than its name; indeed an early astronomical example appears in Gregory & Loredo (1992) in which the posterior mean representation of an unknown signal is constructed for an astronomical “light-curve”, averaging over a set of constant and periodic candidate models. Nevertheless the wider popularisation of model averaging in astronomy has only recently taken place through a variety of applications in cosmology: e.g. Liddle, Mukherjee, Parkinson & Wang (2006) and Vardanyan, Trotta & Silk (2011).

In contrast to earlier studies like Gregory & Loredo (1992)—or the classic review on BMA by Hoeting et al. (1999)—in which the target of model averaging is typically either a utility function, a set of future observations, or a latent parameter of the observational process (e.g. the unknown “light-curve” shape) shared naturally by all competing models, the proposal of cosmological BMA studies is to produce* a model-averaged version of the posterior for a given ‘shared’ parameter*: a so-called “model-averaged PDF”. This proposal didn’t sit well with me back in 2013, and it still doesn’t sit well with me today. *Philosophically*: without a model a parameter has no meaning, so why should we want to seek meaning in the marginalised distribution of a parameter over an entire set of models? And, *practically*: to put it another way, without knowing the model ‘label’ to which a given mass of model-averaged parameter probability belongs there’s nothing much useful we can do with this ‘PDF’: nothing much we can say about the data we’ve just analysed and nothing much we can say about future experiments. Whereas the space of the observed data is shared automatically by all competing models, it seems to me to be somehow “un-Bayesian” to place the further restriction that the parameters of separate models share the same scale and topology. I say “un-Bayesian” since this mode of model averaging suggests a formulation of the parameter space + prior pairing stronger than the statement of one’s prior beliefs for the distribution of observable data given the model. But I would be happy to hear arguments from the other side in the comments box below … ! Continue reading

## vertical likelihood Monte Carlo integration

Posted in Books, pictures, Running, Statistics, Travel, University life with tags Chicago Booth School of Business, importance sampling, Monte Carlo integration, Monte Carlo Statistical Methods, nested sampling, normalising constant, slice sampling, Wang-Landau algorithm on April 17, 2015 by xi'an**A** few months ago, Nick Polson and James Scott arXived a paper on one of my favourite problems, namely the approximation of normalising constants (and it went way under my radar, as I only became aware of it quite recently!, then it remained in my travel bag for an extra few weeks…). The method for approximating the constant Z draws from an analogy with the energy level sampling methods found in physics, like the Wang-Landau algorithm. The authors rely on a one-dimensional slice sampling representation of the posterior distribution and [main innovation in the paper] add a weight function on the auxiliary uniform. The choice of the weight function links the approach with the dreaded harmonic estimator (!), but also with power-posterior and bridge sampling. The paper recommends a specific weighting function, based on a “score-function heuristic” I do not get. Further, the optimal weight depends on intractable cumulative functions as in nested sampling. It would be fantastic if one could draw directly from the prior distribution of the likelihood function—rather than draw an x [from the prior or from something better, as suggested in our 2009 Biometrika paper] and transform it into L(x)—but as in all existing alternatives this alas is not the case. (Which is why I find the recommendations in the paper for practical implementation rather impractical, since, were the prior cdf of L(X) available, direct simulation of L(X) would be feasible. Maybe not the optimal choice though.)

“What is the distribution of the likelihood ordinates calculated via nested sampling? The answer is surprising: it is essentially the same as the distribution of likelihood ordinates by recommended weight function from Section 4.”

The approach is thus very much related to nested sampling, at least in spirit. As the authors later demonstrate, nested sampling is another case of weighting, Both versions require simulations under truncated likelihood values. Albeit with a possibility of going down [in likelihood values] with the current version. Actually, more weighting could prove [more] efficient as both the original nested and vertical sampling simulate from the prior under the likelihood constraint. Getting away from the prior should help. (I am quite curious to see how the method is received and applied.)

## a neat (theoretical) Monte Carlo result

Posted in Books, Statistics, University life with tags confidence sets, importance sampling, Monte Carlo integration, unbiased estimation, variance reduction on December 19, 2014 by xi'an**M**ark Huber just arXived a short paper where he develops a Monte Carlo approach that bounds the probability of large errors

by computing a lower bound on the sample size r and I wondered at the presence of μ in the bound as it indicates the approach is not translation invariant. One reason is that the standard deviation of the simulated random variables is bounded by cμ. Another reason is that Mark uses as its estimator the median

where the S’s are partial averages of sufficient length and the R’s are independent uniforms over (1-ε,1+ε): using those uniforms may improve the coverage of given intervals but it also means that the absolute scale of the error is multiplied by the scale of S, namely μ. I first thought that some a posteriori recentering could improve the bound but since this does not impact the variance of the simulated random variables, I doubt it is possible.

## density normalization for MCMC algorithms

Posted in Statistics, University life with tags harmonic mean estimator, importance sampling, Laplace approximation, MCMC, MCQMC, Monte Carlo Statistical Methods, normalising constant, Rao-Blackwellisation, untractable normalizing constant on November 6, 2014 by xi'an**A**nother paper addressing the estimation of the normalising constant and the wealth of available solutions just came out on arXiv, with the full title of “Target density normalization for Markov chain Monte Carlo algorithms“, written by Allen Caldwell and Chang Liu. (I became aware of it by courtesy of Ewan Cameron, as it appeared in the physics section of arXiv. It is actually a wee bit annoying that papers in the subcategory “Data Analysis, Statistics and Probability” of physics do not get an automated reposting on the statistics lists…)

In this paper, the authors compare three approaches to the problem of finding

when the density f is unormalised, i.e., in more formal terms, when f is proportional to a probability density (and available):

- an “arithmetic mean”, which is an importance sampler based on (a) reducing the integration volume to a neighbourhood ω of the global mode. This neighbourhood is chosen as an hypercube and the importance function turns out to be the uniform over this hypercube. The corresponding estimator is then a rescaled version of the average of f over uniform simulations in ω.
- an “harmonic mean”, of all choices!, with again an integration over the neighbourhood ω of the global mode in order to avoid the almost sure infinite variance of harmonic mean estimators.
- a Laplace approximation, using the target at the mode and the Hessian at the mode as well.

The paper then goes to comparing those three solutions on a few examples, demonstrating how the diameter of the hypercube can be calibrated towards a minimum (estimated) uncertainty. The rather anticlimactic conclusion is that the arithmetic mean is the most reliable solution as harmonic means may fail in larger dimension and more importantly fail to signal its failure, while Laplace approximations only approximate well quasi-Gaussian densities…

What I find most interesting in this paper is the idea of using only one part of the integration space to compute the integral, even though it is not exactly new. Focussing on a specific region ω has pros and cons, the pros being that the reduction to a modal region reduces needs for absolute MCMC convergence and helps in selecting alternative proposals and also prevents from the worst consequences of using a dreaded harmonic mean, the cons being that the region needs be well-identified, which means requirements on the MCMC kernel, and that the estimate is a product of two estimates, the frequency being driven by a Binomial noise. I also like very much the idea of calibrating the diameter Δof the hypercube ex-post by estimating the uncertainty.

As an aside, the paper mentions most of the alternative solutions I just presented in my Monte Carlo graduate course two days ago (like nested or bridge or Rao-Blackwellised sampling, including our proposal with Darren Wraith), but dismisses them as not “directly applicable in an MCMC setting”, i.e., without modifying this setting. I unsurprisingly dispute this labelling, both because something like the Laplace approximation requires extra-work on the MCMC output (and once done this work can lead to advanced Laplace methods like INLA) and because other methods could be considered as well (for instance, bridge sampling over several hypercubes). As shown in the recent paper by Mathieu Gerber and Nicolas Chopin (soon to be discussed at the RSS!), MCqMC has also become a feasible alternative that would compete well with the methods studied in this paper.

Overall, this is a paper that comes in a long list of papers on constant approximations. I do not find the Markov chain of MCMC aspect particularly compelling or specific, once the effective sample size is accounted for. It would be nice to find generic ways of optimising the visit to the hypercube ω *and* to estimate efficiently the weight of ω. The comparison is solely run over examples, but they all rely on a proper characterisation of the hypercube and the ability to simulate efficiently f over that hypercube.

## my life as a mixture [BAYSM 2014, Wien]

Posted in Books, Kids, Mountains, pictures, Statistics, Travel, University life with tags Austria, Bayes factor, Bayesian tests of hypotheses, BAYSM, Gibbs sampling, importance sampling, label switching, Linz, mixtures, Vienna, WU Wirtschaftsuniversität Wien on September 12, 2014 by xi'an**N**ext week I am giving a talk at BAYSM in Vienna. BAYSM is the Bayesian *Young* Statisticians meeting so one may wonder why, but with Chris Holmes and Mike West, we got invited as more… erm… senior speakers! So I decided to give a definitely *senior* talk on a thread pursued throughout my career so far, namely mixtures. Plus it also relates to works of the other senior speakers. Here is the abstract for the talk:

Mixtures of distributions are fascinating objects for statisticians in that they both constitute a straightforward extension of standard distributions and offer a complex benchmark for evaluating statistical procedures, with a likelihood both computable in a linear time and enjoying an exponential number of local models (and sometimes infinite modes). This fruitful playground appeals in particular to Bayesians as it constitutes an easily understood challenge to the use of improper priors and of objective Bayes solutions. This talk will review some ancient and some more recent works of mine on mixtures of distributions, from the 1990 Gibbs sampler to the 2000 label switching and to later studies of Bayes factor approximations, nested sampling performances, improper priors, improved importance samplers, ABC, and a inverse perspective on the Bayesian approach to testing of hypotheses.

**I** am very grateful to the scientific committee for this invitation, as it will give me the opportunity to meet the new generation, learn from them and in addition discover Vienna where I have never been, despite several visits to Austria. Including its top, the Gro*ß*glockner. I will also give a seminar in Linz the day before. In the Institut für Angewandte Statistik.

## PMC for combinatoric spaces

Posted in Statistics, University life with tags AMIS, CUNY, importance sampling, Monte Carlo Statistical Methods, PMC, population Monte Carlo, simulation, unbiasedness on July 28, 2014 by xi'an**I** received this interesting [edited] email from Xiannian Fan at CUNY:

I am trying to use PMC to solve Bayesian network structure learning problem (which is in a combinatorial space, not continuous space).

In PMC, the proposal distributions q

_{i,t}can be very flexible, even specific to each iteration and each instance. My problem occurs due to the combinatorial space.For importance sampling, the requirement for proposal distribution, q, is:

support (p) ⊂ support (q) (*)

For PMC, what is the support of the proposal distribution in iteration t? is it

support (p) ⊂ U support(q

_{i,t}) (**)or does (*) apply to every q

_{i,t}?For continuous problem, this is not a big issue. We can use random walk of Normal distribution to do local move satisfying (*). But for combination search, local moving only result in finite states choice, just not satisfying (*). For example for a permutation (1,3,2,4), random swap has only choose(4,2)=6 neighbor states.

**F**airly interesting question about population Monte Carlo (PMC), a sequential version of importance sampling we worked on with French colleagues in the early 2000’s. (The name population Monte Carlo comes from Iba, 2000.) While MCMC samplers do not have to cover the whole support of p at each iteration, it is much harder for importance samplers as their core justification is to provide an unbiased estimator to for all integrals of interest. Thus, when using the PMC estimate,

1/n ∑_{i,t} {p(x_{i,t})/q_{i,t}(x_{i,t})}h(q_{i,t}), x_{i,t~}q_{i,t(x})

this estimator is only unbiased when the supports of the q_{i,t }“s are all containing the support of p. The only other cases I can think of are

- associating the q
_{i,t }“s with a partition S_{i,t}of the support of p and using instead∑

_{i,t}{p(x_{i,t})/q_{i,t}(x_{i,t})}h(q_{i,t}), x_{i,t~}q_{i,t(x}) - resorting to AMIS under the assumption (**) and using instead
1/n ∑

_{i,t}{p(x_{i,t})/∑_{j,t}q_{j,t}(x_{i,t})}h(q_{i,t}), x_{i,t~}q_{i,t(x})

but I am open to further suggestions!