## Bayesian model averaging in astrophysics [guest post]

.*[Following my posting of a misfiled 2013 blog, Ewan Cameron told me of the impact of this paper in starting his own blog and I asked him for a guest post, resulting in this analysis, much deeper than mine. No warning necessary this time!]*

**B**ack in February 2013 when “Bayesian Model Averaging in Astrophysics: A Review” by Parkinson & Liddle (hereafter PL13) first appeared on the arXiv I was a keen, young(ish) postdoc eager to get stuck into debates about anything and everything ‘astro-statistical’. And with its seemingly glaring flaws, PL13 was more grist to the mill. However, despite my best efforts on various forums I couldn’t get a decent fight started over the right way to do model averaging (BMA) in astronomy, so out of sheer frustration two months later I made my own soapbox to shout from at Another Astrostatistics Blog. Having seen PL13 reviewed recently here on Xian’s Og it feels like the right time to revisit the subject and reflect on where BMA in astronomy is today.

As pointed out to me back in 2013 by Tom Loredo, the act of Bayesian model averaging has been around much longer than its name; indeed an early astronomical example appears in Gregory & Loredo (1992) in which the posterior mean representation of an unknown signal is constructed for an astronomical “light-curve”, averaging over a set of constant and periodic candidate models. Nevertheless the wider popularisation of model averaging in astronomy has only recently taken place through a variety of applications in cosmology: e.g. Liddle, Mukherjee, Parkinson & Wang (2006) and Vardanyan, Trotta & Silk (2011).

In contrast to earlier studies like Gregory & Loredo (1992)—or the classic review on BMA by Hoeting et al. (1999)—in which the target of model averaging is typically either a utility function, a set of future observations, or a latent parameter of the observational process (e.g. the unknown “light-curve” shape) shared naturally by all competing models, the proposal of cosmological BMA studies is to produce* a model-averaged version of the posterior for a given ‘shared’ parameter*: a so-called “model-averaged PDF”. This proposal didn’t sit well with me back in 2013, and it still doesn’t sit well with me today. *Philosophically*: without a model a parameter has no meaning, so why should we want to seek meaning in the marginalised distribution of a parameter over an entire set of models? And, *practically*: to put it another way, without knowing the model ‘label’ to which a given mass of model-averaged parameter probability belongs there’s nothing much useful we can do with this ‘PDF’: nothing much we can say about the data we’ve just analysed and nothing much we can say about future experiments. Whereas the space of the observed data is shared automatically by all competing models, it seems to me to be somehow “un-Bayesian” to place the further restriction that the parameters of separate models share the same scale and topology. I say “un-Bayesian” since this mode of model averaging suggests a formulation of the parameter space + prior pairing stronger than the statement of one’s prior beliefs for the distribution of observable data given the model. But I would be happy to hear arguments from the other side in the comments box below … !

A couple of ancillary points I noticed on re-reading PL13 that might give non-astronomers some insight into the world of astro-statistics:

“Importance sampling is the idea that you can generate your samples completely stochastically from some distribution.”

“For example, if there is some new data that will update constraints already in place from some previous experiment, you could simply recompute the new posterior (P) at points given by a previous MCMC chain (the values of q), importance sampling using the chain as the proposal distribution.”

(1) Although it proves a natural entry point to the field of Monte Carlo approximations for probabilistic integrals, importance sampling is still not a widely known technique in astronomy and where it does appear it will often be used somewhat unconventionally, as suggested by the quote above. In fact the reweighting of draws from an MCMC chain to approximate functionals from an (hopefully nearby) alternative target distribution—a pseudo-pseudo marginal method?—is by far the most common scheme passing under the banner of importance sampling in the astronomical literature (e.g. Hogg, Myers & Bovy 2010; Asensio Ramos & Martínez González 2014; Green et al. 2015). As noted by PL13, population Monte Carlo algorithms featuring importance sampling moves have nevertheless been well promoted for cosmological model selection purposes (e.g. Kilbinger et al. 2010); though particle filtering techniques in general have not yet become popular. This is despite the inherent temporal and/or spatial indexing of noisy latent variables in many common astronomical problems.

“Nested sampling is a Monte Carlo method that recasts the multi-dimensional evidence integral as a one-dimensional integral in terms of the prior mass …”

(2) Nested sampling (Skilling 2006) is easily the most popular algorithm for estimating marginal likelihoods in astronomy and cosmology. While perhaps over-used in this field (i.e., used in cases where simple importance sampling or Gauss-Hermite quadrature would suffice) and ‘fetishised’ for its (not unique) ability to ‘simultaneously compute the evidence and posterior’, nested sampling *is* under-used and under-appreciated in mainstream statistics (though cf. Chopin & Robert 2009). The nested sampling identity—namely the observation that for a non-negative random variable the expectation value equals the integral of the survival function—is often encountered in introductions to measure-theoretic probability (it appears in both Billingsley and Feller) and in proofs using the Prohorov metric; likewise, the algorithmic ‘trick’ of nested sampling—the association of prior mass truncated by the lowest likelihood amongst the current sample of ‘live points’ is a clever use of familiar order statistics. Extensions to the basic nested sampling algorithm include diffusive nested sampling (Brewer, Partay & Csanyi 2010) and ‘PolyChord’ (i.e. nested sampling with slice sampling; Handley, Hobson & Lasenby 2015). And while astronomers can’t help but insist on the prior for nested sampling problems being cast as a uniform hypercube (e.g. PL13; Allison & Dunkley 2014; Buchner 2014) there is no such restriction intrinsic to the algorithm which holds irrespective of the topology of the prior: i.e., one could imagine applications to a variety of semi-parametric Bayesian models as well.

“Similarly, we could use the chain to estimate the evidence, … Eqn 6”

(3) Aside from the silliness of Eqn 6

(Why write down this expression in the first place, even if it is acknowledged as neither unbiased nor converging?) the above quote does remind me how much enthusiasm remains in the astronomical community for approximations to the marginal likelihood, or approximations to the Bayes factor, requiring only samples from the posterior itself. These range from the ridiculous (e.g. the so-called ‘stabilised’ harmonic mean estimator of Tuomi & Jones 2012) to the innovative (e.g. the Numerical Lebesgue Algorithm of Weinberg 2012; though I still find this approach problematic); and in all cases are applied in a slightly idiosyncratic manner (e.g. when the Savage-Dickey Density Ratio is computed using a parametric approximation to the PDF of the larger model *à la* Trotta 2008 rather than through a more general estimator like that of Marin & Robert 2010). (Surprisingly though, the technique suggested by Perrakis, Ntzoufras & Tsionas 2013 has yet to be embraced.) On the other hand, there remains a great fear of thermodynamic integration and path sampling methods following a poor experience recounted by Beltran et al. 2005 and the (not entirely unwarranted, but to my mind, hyperbolic) discussion of the ‘thermodynamic catastrophe’ in Skilling (2006).

August 26, 2015 at 3:36 pm

[…] theory (both discrete & general state space) and I contributed a long post of my thoughts on Bayesian model averaging in astronomy to Xian’s […]

August 13, 2015 at 11:54 am

I’d like to know why nested sampling is under-used in statistics. What is the typical case for which nested sampling works better than SMC samplers to estimate the normalizing constant and sampling from the posterior?

August 13, 2015 at 10:08 pm

I would expect NS (with MCMC sampling from the likelihood-constrained prior to refresh the live particles) would outperform the equivalent off-the-shelf SMC sampler for problems where (i) there is a phase-change that makes mixing on the thermodynamic path difficult and (ii) the posterior is difficult to explore with SMC (e.g. hard to find a good refreshment kernel).

More generally the trick of probabilistically associating particles drawn from some distribution ranked according to some total order on the set (e.g. likelihood) with the proportion of mass below the value of the ordering function at each particle must surely have uses not yet thought of.

August 14, 2015 at 1:14 am

The trouble with nested sampling are those constrained-prior draws — even for simple prior distributions concentration of measure is going to push more and more of the samples right on the boundary as the constraints tighten. Diffusive MCMC transitions will fail miserably and even constrained HMC will end up bouncing too much and only slowly moving transverse to the boundary. Using the likelihood to inform a metric for RHMC will help (the trajectories will tend to move transverse and avoid bounding) but I never got that far in the implementation. I had reasonable success with nested sampling in my graduate thesis, but it was relatively small problem (less than 10 dimensions) and even then there were some weird behaviors that I never fully understood. Ultimately nested sampling is rather elegant in theory, but in practice we’re still very far away from a robust implementation.

August 15, 2015 at 11:26 am

I followed this debate up with Pierre last night at the pub and we decided the only sensible solution was a head to head battle—NS vs SMC—with each contestant submitting a (somewhat realistic / non-toy) problem they believe their algorithm can manage but the other will not. Obviously it’ll be a somewhat iterative process as we refine the ‘rules’ to make a fair comparison (e.g. for MCMC moves we’re using an equivalent sampler), but I suspect Pierre would appreciate it if you had a specific, non-toy example where concentration of measure would be problematic for NS but not SMC.

Btw. I understand what you mean by concentration of measure, but I’m not sure that I see it as being a limiting problem for NS as I would expect a clever choice of MCMC proposal based on the current set of live particles would suffice to limit the inefficiency imposed by the concentration of measure at the boundary.

August 16, 2015 at 8:40 am

>> I followed this debate up with Pierre last night at the pub and we decided the only sensible solution was a head to head battle—NS vs SMC—with each contestant submitting a (somewhat realistic / non-toy) problem they believe their algorithm can manage but the other will not. Obviously it’ll be a somewhat iterative process as we refine the ‘rules’ to make a fair comparison (e.g. for MCMC moves we’re using an equivalent sampler), but I suspect Pierre would appreciate it if you had a specific, non-toy example where concentration of measure would be problematic for NS but not SMC.

Let me be clear, I’m not advocating SMC over NS! I don’t think SMC will be all that better in these cases and, ultimately, I think Adiabatic Monte Carlo will be the right answer. Of course it needs lots more work.

>> Btw. I understand what you mean by concentration of measure, but I’m not sure that I see it as being a limiting problem for NS as I would expect a clever choice of MCMC proposal based on the current set of live particles would suffice to limit the inefficiency imposed by the concentration of measure at the boundary.

I think anything with weak non-identifiabilites in the likelihood would be sufficiently hard. Even linear correlations might suffice in enough dimensions. Or perhaps a hierarchical model.

August 12, 2015 at 10:51 pm

@Dan: So this is my first point, applying BMA to parameters certainly requires this additional condition that the model parameters be “the same”, whereas for predictions on the data space this is already in-built to the Bayesian machine. But even then the model averaged PDF isn’t to my mind telling us anything useful as (generally) we’ve thrown away our knowledge of which model each bit of mass came from, which we need to know in order to avoid mistaken conclusions. [My opinion on this is similar to my opinion on the usefulness of “mean inclusion probability” as a summary of covariate importance when searching through subsets of possible covariates to define a parsimonious regression model.]

August 12, 2015 at 11:36 pm

Mean/median inclusion probabilities are bad for a different reason: for that problem dependence amongst covariates is vital for determining inclusion (think about colinearity). I’m not sure this is the problem here.

I guess I was reaching towards a question that I hadn’t quite found: BMA is a tool for predictions, but predictions are (X is in a different hemisphere so I feel safe saying this) extremely useful for model choice. So I guess the question that’s interesting that me is “what needs to be added to BMA to use it for testing?”

My suggestion was that you could use it to partition model space between “candidate” and “other” models and see which is supported. This could be useful when you’re comparing multiple competing mechanisms, each of which has several possible models.

But that could be really dumb. I really really really know nothing about testing or model choice.

August 14, 2015 at 1:23 am

I think the problem here is that the right persecutive on BMA isn’t prediction but rather expectation in general. The classic example is having one model concentrating to positive values and one to negative values and trying to make a decision of which way to turn by querying the expected utility. Do we want the average model (i.e. the average of the two component expectations) which says to go straight or to randomly sample one of the models (and turn left or right)? Prediction is then just a special case of a predictive expectation (see Section 3.3.2 of http://arxiv.org/abs/1506.02273). And then this is intimately tied to any multimodal model (which you can decompose as a mixture of component models for each mode) — do you average over all modes or sample from one?

So do we want to think of BMA as a lossy process that looses all information of the individual model components and yields just a single marginalized model, or rather as a way of assigning relative probability to each model? Personally I think the latter is more useful scientifically.

August 12, 2015 at 10:02 am

PL13 sparked a whole weekly study group in our department, but confused the hell out of me for precisely the same reason you share here: θ (shared model parameter) means something different in each model, so averaging over everything else (models and each of their other parameters) doesn’t make sense.

Hoeting 1999 gave a far clearer definition in terms of some “quantity of interest” Δ, and I now consider BMA – marginalisation over nuisance models – to be elegant, mathematically sound, and a natural extension of Bayesian inference.

Back to PL13, if the parameter in question can be re-interpreted as a simple “quantity of interest” (e.g. y-intercept parameter of a straight line, quadratic etc interpreted as “what value of y do we predict when x=0?”), then it works, I guess. But talking about BMA in the context of parameter posteriors restricts you to models in which that quantity is explicitly defined as a parameter.

August 12, 2015 at 3:06 pm

Good points, Madhura: I think model averaging only makes sense from a predictive perspective. As soon as you get interested in a parameter, its meaning changes from one model to the next and so, loosely speaking, it is not “the same”…

August 12, 2015 at 5:50 pm

That’s not always true. In some cases (such as the overall variance parameter*, or overall location parameters), it is often possible to parameterise the model in such a way that some parameters have the same meaning across models.

I would argue that if you have a parameter of interest and you’re comparing it across models, it needs to be “the same”. Otherwise I can’t see how you can draw any meaningful conclusions.

Actually (writing this sequentially), it may still be meaningful to consider BMA when not all models have the parameter of interest. There should be an interpretation in terms of “hypothesis testing” (does the span of the models with the meaningful parameters explain the data better than the span of the models without). Although in this case it seems easier to do a hypothesis test against a non-parameteric alternative instead of having to explictly specify the space of alternative models…

*we did this in Sections 6 and 8 of the PC Prior paper (current version), http://arxiv.org/abs/1403.4630, but I also saw it for location parameters on a poster you (X) had at O’Bayes (with Kate Lee, maybe?). It’s probably a thing more generally.

August 14, 2015 at 1:27 am

But remember that you can always interpret the marginalized model as a mixture and then any expectation can be done by first taking the expectation of each mixture component and then averaging those expectations. From this perspective each component model’s parameters have meaning only in the inner expectation and one doesn’t really have to worry about the philosophical implications of “shared” parameters.*

* Of course technically you have to be really careful to ensure that the model sample spaces are compatible with each other in which case there _have_ to be bijections between the overlaps of the sample spaces. But you don’t have to _interpret_ those bijections.