Bayesian model averaging in astrophysics [guest post]
.[Following my posting of a misfiled 2013 blog, Ewan Cameron told me of the impact of this paper in starting his own blog and I asked him for a guest post, resulting in this analysis, much deeper than mine. No warning necessary this time!]
Back in February 2013 when “Bayesian Model Averaging in Astrophysics: A Review” by Parkinson & Liddle (hereafter PL13) first appeared on the arXiv I was a keen, young(ish) postdoc eager to get stuck into debates about anything and everything ‘astro-statistical’. And with its seemingly glaring flaws, PL13 was more grist to the mill. However, despite my best efforts on various forums I couldn’t get a decent fight started over the right way to do model averaging (BMA) in astronomy, so out of sheer frustration two months later I made my own soapbox to shout from at Another Astrostatistics Blog. Having seen PL13 reviewed recently here on Xian’s Og it feels like the right time to revisit the subject and reflect on where BMA in astronomy is today.
As pointed out to me back in 2013 by Tom Loredo, the act of Bayesian model averaging has been around much longer than its name; indeed an early astronomical example appears in Gregory & Loredo (1992) in which the posterior mean representation of an unknown signal is constructed for an astronomical “light-curve”, averaging over a set of constant and periodic candidate models. Nevertheless the wider popularisation of model averaging in astronomy has only recently taken place through a variety of applications in cosmology: e.g. Liddle, Mukherjee, Parkinson & Wang (2006) and Vardanyan, Trotta & Silk (2011).
In contrast to earlier studies like Gregory & Loredo (1992)—or the classic review on BMA by Hoeting et al. (1999)—in which the target of model averaging is typically either a utility function, a set of future observations, or a latent parameter of the observational process (e.g. the unknown “light-curve” shape) shared naturally by all competing models, the proposal of cosmological BMA studies is to produce a model-averaged version of the posterior for a given ‘shared’ parameter: a so-called “model-averaged PDF”. This proposal didn’t sit well with me back in 2013, and it still doesn’t sit well with me today. Philosophically: without a model a parameter has no meaning, so why should we want to seek meaning in the marginalised distribution of a parameter over an entire set of models? And, practically: to put it another way, without knowing the model ‘label’ to which a given mass of model-averaged parameter probability belongs there’s nothing much useful we can do with this ‘PDF’: nothing much we can say about the data we’ve just analysed and nothing much we can say about future experiments. Whereas the space of the observed data is shared automatically by all competing models, it seems to me to be somehow “un-Bayesian” to place the further restriction that the parameters of separate models share the same scale and topology. I say “un-Bayesian” since this mode of model averaging suggests a formulation of the parameter space + prior pairing stronger than the statement of one’s prior beliefs for the distribution of observable data given the model. But I would be happy to hear arguments from the other side in the comments box below … !
A couple of ancillary points I noticed on re-reading PL13 that might give non-astronomers some insight into the world of astro-statistics:
“Importance sampling is the idea that you can generate your samples completely stochastically from some distribution.”
“For example, if there is some new data that will update constraints already in place from some previous experiment, you could simply recompute the new posterior (P) at points given by a previous MCMC chain (the values of q), importance sampling using the chain as the proposal distribution.”
(1) Although it proves a natural entry point to the field of Monte Carlo approximations for probabilistic integrals, importance sampling is still not a widely known technique in astronomy and where it does appear it will often be used somewhat unconventionally, as suggested by the quote above. In fact the reweighting of draws from an MCMC chain to approximate functionals from an (hopefully nearby) alternative target distribution—a pseudo-pseudo marginal method?—is by far the most common scheme passing under the banner of importance sampling in the astronomical literature (e.g. Hogg, Myers & Bovy 2010; Asensio Ramos & Martínez González 2014; Green et al. 2015). As noted by PL13, population Monte Carlo algorithms featuring importance sampling moves have nevertheless been well promoted for cosmological model selection purposes (e.g. Kilbinger et al. 2010); though particle filtering techniques in general have not yet become popular. This is despite the inherent temporal and/or spatial indexing of noisy latent variables in many common astronomical problems.
“Nested sampling is a Monte Carlo method that recasts the multi-dimensional evidence integral as a one-dimensional integral in terms of the prior mass …”
(2) Nested sampling (Skilling 2006) is easily the most popular algorithm for estimating marginal likelihoods in astronomy and cosmology. While perhaps over-used in this field (i.e., used in cases where simple importance sampling or Gauss-Hermite quadrature would suffice) and ‘fetishised’ for its (not unique) ability to ‘simultaneously compute the evidence and posterior’, nested sampling is under-used and under-appreciated in mainstream statistics (though cf. Chopin & Robert 2009). The nested sampling identity—namely the observation that for a non-negative random variable the expectation value equals the integral of the survival function—is often encountered in introductions to measure-theoretic probability (it appears in both Billingsley and Feller) and in proofs using the Prohorov metric; likewise, the algorithmic ‘trick’ of nested sampling—the association of prior mass truncated by the lowest likelihood amongst the current sample of ‘live points’ is a clever use of familiar order statistics. Extensions to the basic nested sampling algorithm include diffusive nested sampling (Brewer, Partay & Csanyi 2010) and ‘PolyChord’ (i.e. nested sampling with slice sampling; Handley, Hobson & Lasenby 2015). And while astronomers can’t help but insist on the prior for nested sampling problems being cast as a uniform hypercube (e.g. PL13; Allison & Dunkley 2014; Buchner 2014) there is no such restriction intrinsic to the algorithm which holds irrespective of the topology of the prior: i.e., one could imagine applications to a variety of semi-parametric Bayesian models as well.
“Similarly, we could use the chain to estimate the evidence, … Eqn 6”
(3) Aside from the silliness of Eqn 6
(Why write down this expression in the first place, even if it is acknowledged as neither unbiased nor converging?) the above quote does remind me how much enthusiasm remains in the astronomical community for approximations to the marginal likelihood, or approximations to the Bayes factor, requiring only samples from the posterior itself. These range from the ridiculous (e.g. the so-called ‘stabilised’ harmonic mean estimator of Tuomi & Jones 2012) to the innovative (e.g. the Numerical Lebesgue Algorithm of Weinberg 2012; though I still find this approach problematic); and in all cases are applied in a slightly idiosyncratic manner (e.g. when the Savage-Dickey Density Ratio is computed using a parametric approximation to the PDF of the larger model à la Trotta 2008 rather than through a more general estimator like that of Marin & Robert 2010). (Surprisingly though, the technique suggested by Perrakis, Ntzoufras & Tsionas 2013 has yet to be embraced.) On the other hand, there remains a great fear of thermodynamic integration and path sampling methods following a poor experience recounted by Beltran et al. 2005 and the (not entirely unwarranted, but to my mind, hyperbolic) discussion of the ‘thermodynamic catastrophe’ in Skilling (2006).