A short announcement that the slides of almost all talks at the CRiSM workshop on estimating constants last April 20-22 are now available. Enjoy (and dicuss)!
Archive for evidence
The schedule for the CRiSM workshop on estimating constants that Nial Friel, Helen Ogden and myself host next April 20-22 at the University of Warwick is now set as follows. (The plain registration fees are £40 and accommodation on the campus is available through the online form.)
April 20, 2016
11:45 — 12:30: Adam Johansen
12:30 — 14:00: Lunch
14:00 — 14:45: Anne-Marie Lyne
14:45 — 15:30: Pierre Jacob
15:30 — 16:00: Break
16:00 — 16:45: Roberto Trotta
17:00 — 18:00: ‘Elevator’ talks
18:00 — 20:00: Poster session, Cheese and wine
April 21, 2016
9:00 — 9:45: Michael Betancourt
9:45 — 10:30: Nicolas Chopin
10:30 — 11:00: Coffee break
11:00 — 11:45: Merrilee Hurn
11:45 — 12:30: Jean-Michel Marin
12:30 — 14:00: Lunch
14:00 — 14:45: Sumit Mukherjee
14:45 — 15:30: Yves Atchadé
15:30 — 16:00: Break
16:00 — 16:45: Michael Gutmann
16:45 — 17:30: Panayiota Touloupou
19:00 — 22:00: Dinner
April 22, 2016
9:00 — 9:45: Chris Sherlock
9:45 — 10:30: Christophe Andrieu
10:30 — 11:00: Coffee break
11:00 — 11:45: Antonietta Mira
“If no information is available, π(α|M) must not deliver information about α.”
In a recent arXival apparently submitted to Bayesian Analysis, Giovanni Mana and Carlo Palmisano discuss of the choice of priors in metrology. Which reminded me of this meeting I attended at the Bureau des Poids et Mesures in Sèvres where similar debates took place, albeit being led by ferocious anti-Bayesians! Their reference prior appears to be the Jeffreys prior, because of its reparameterisation invariance.
“The relevance of the Jeffreys rule in metrology and in expressing uncertainties in measurements resides in the metric invariance.”
This, along with a second order approximation to the Kullback-Leibler divergence, is indeed one reason for advocating the use of a Jeffreys prior. I at first found it surprising that the (usually improper) prior is used in a marginal likelihood, as it cannot be normalised. A source of much debate [and of our alternative proposal].
“To make a meaningful posterior distribution and uncertainty assessment, the prior density must be covariant; that is, the prior distributions of different parameterizations must be obtained by transformations of variables. Furthermore, it is necessary that the prior densities are proper.”
The above quote is quite interesting both in that the notion of covariant is used rather than invariant or equivariant. And in that properness is indicated as a requirement. (Even more surprising is the noun associated with covariant, since it clashes with the usual notion of covariance!) They conclude that the marginal associated with an improper prior is null because the normalising constant of the prior is infinite.
“…the posterior probability of a selected model must not be null; therefore, improper priors are not allowed.”
Maybe not so surprisingly given this stance on improper priors, the authors cover a collection of “paradoxes” in their final and longest section: most of which makes little sense to me. First, they point out that the reference priors of Berger, Bernardo and Sun (2015) are not invariant, but this should not come as a surprise given that they focus on parameters of interest versus nuisance parameters. The second issue pointed out by the authors is that under Jeffreys’ prior, the posterior distribution of a given normal mean for n observations is a t with n degrees of freedom while it is a t with n-1 degrees of freedom from a frequentist perspective. This is not such a paradox since both distributions work in different spaces. Further, unless I am confused, this is one of the marginalisation paradoxes, which more straightforward explanation is that marginalisation is not meaningful for improper priors. A third paradox relates to a contingency table with a large number of cells, in that the posterior mean of a cell probability goes as the number of cells goes to infinity. (In this case, Jeffreys’ prior is proper.) Again not much of a bummer, there is simply not enough information in the data when faced with a infinite number of parameters. Paradox #4 is the Stein paradox, when estimating the squared norm of a normal mean. Jeffreys’ prior then leads to a constant bias that increases with the dimension of the vector. Definitely a bad point for Jeffreys’ prior, except that there is no Bayes estimator in such a case, the Bayes risk being infinite. Using a renormalised loss function solves the issue, rather than introducing as in the paper uniform priors on intervals, which require hyperpriors without being particularly compelling. The fifth paradox is the Neyman-Scott problem, with again the Jeffreys prior the culprit since the estimator of the variance is inconsistent. By a multiplicative factor of 2. Another stone in Jeffreys’ garden [of forking paths!]. The authors consider that the prior gives zero weight to any interval not containing zero, as if it was a proper probability distribution. And “solve” the problem by avoid zero altogether, which requires of course to specify a lower bound on the variance. And then introducing another (improper) Jeffreys prior on that bound… The last and final paradox mentioned in this paper is one of the marginalisation paradoxes, with a bizarre explanation that since the mean and variance μ and σ are not independent a posteriori, “the information delivered by x̄ should not be neglected”.
Minh-Ngoc Tran and Robert Kohn have devised an “exact” ABC algorithm. They claim therein to remove the error due to the non-zero tolerance by using an unbiased estimator of the likelihood. Most interestingly, they start from the debiasing technique of Rhee and Glynn [also at the basis of the Russian roulette]. Which sums up as using a telescoping formula on a sequence of converging biased estimates. And cutting the infinite sum with a stopping rule.
“Our article proposes an ABC algorithm to estimate [the observed likelihood] that completely removes the error due to [the ABC] approximation…”
The sequence of biased but converging approximations is associated with a sequence of decreasing tolerances. The corresponding sequence of weights that determines the truncation in the series is connected to the decrease in the bias in an implicit manner for all realistic settings. Although Theorem 1 produces conditions on the ABC kernel and the sequence of tolerances and pseudo-sample sizes that guarantee unbiasedness and finite variance of the likelihood estimate. For a geometric stopping rule with rejection probability p, both tolerance and pseudo-sample size decrease as a power of p. As a side product the method also returns an unbiased estimate of the evidence. The overall difficulty I have with the approach is the dependence on the stopping rule and its calibration, and the resulting impact on the computing time of the likelihood estimate. When this estimate is used in a pseudo-marginal scheme à la Andrieu and Roberts (2009), I fear this requires new pseudo-samples at each iteration of the Metropolis-Hastings algorithm, which then becomes prohibitively expensive. Later today, Mark Girolami pointed out to me that Anne-Marie Lyne [one of the authors of the Russian roulette paper] also considered this exact approach in her thesis and concluded at an infinite computing time.
The registration for the CRiSM workshop on estimating constants that Nial Friel, Helen Ogden and myself host next April 20-22 at the University of Warwick is now open. The plain registration fees are £40 and accommodation on the campus is available through the same form.
Since besides the invited talks, the workshop will host two poster session with speed (2-5mn) oral presentations, we encourage all interested researchers to submit a poster via the appropriate form. Once again, this should be an exciting two-day workshop, given the on-going activity in this area.
In this review paper, now published in Statistical Analysis and Data Mining 6, 3 (2013), David Parkinson and Andrew R. Liddle go over the (Bayesian) model selection and model averaging perspectives. Their argument in favour of model averaging is that model selection via Bayes factors may simply be too inconclusive to favour one model and only one model. While this is a correct perspective, this is about it for the theoretical background provided therein. The authors then move to the computational aspects and the first difficulty is their approximation (6) to the evidence
where they average the likelihood x prior terms over simulations from the posterior, which does not provide a valid (either unbiased or converging) approximation. They surprisingly fail to account for the huge statistical literature on evidence and Bayes factor approximation, incl. Chen, Shao and Ibrahim (2000). Which covers earlier developments like bridge sampling (Gelman and Meng, 1998).
As often the case in astrophysics, at least since 2007, the authors’ description of nested sampling drifts away from perceiving it as a regular Monte Carlo technique, with the same convergence speed n1/2 as other Monte Carlo techniques and the same dependence on dimension. It is certainly not the only simulation method where the produced “samples, as well as contributing to the evidence integral, can also be used as posterior samples.” The authors then move to “population Monte Carlo [which] is an adaptive form of importance sampling designed to give a good estimate of the evidence”, a particularly restrictive description of a generic adaptive importance sampling method (Cappé et al., 2004). The approximation of the evidence (9) based on PMC also seems invalid:
is missing the prior in the numerator. (The switch from θ in Section 3.1 to X in Section 3.4 is confusing.) Further, the sentence “PMC gives an unbiased estimator of the evidence in a very small number of such iterations” is misleading in that PMC is unbiased at each iteration. Reversible jump is not described at all (the supposedly higher efficiency of this algorithm is far from guaranteed when facing a small number of models, which is the case here, since the moves between models are governed by a random walk and the acceptance probabilities can be quite low).
The second quite unrelated part of the paper covers published applications in astrophysics. Unrelated because the three different methods exposed in the first part are not compared on the same dataset. Model averaging is obviously based on a computational device that explores the posteriors of the different models under comparison (or, rather, averaging), however no recommendation is found in the paper as to efficiently implement the averaging or anything of the kind. In conclusion, I thus find this review somehow anticlimactic.
As I have a huge arXiv backlog and an even higher non-arXiv backlog, I cannot be certain I will find time to comment on those three recent and quite exciting postings connecting ABC with astro- and cosmo-statistics [thanks to Ewan for pointing out those to me!]:
- Weighted ABC: a new strategy for cluster strong lensing cosmology with simulations, by Madhura Killedar et al. [Madhura won one of the three prizes at the BAYESM meeting last year]:
“We investigate the uncertainty in the calculated likelihood,and consequential ability to compare competing cosmologies…”
- Inflation, evidence and falsifiability, by Giulia Gubitosi et al.:
“By considering toy models we illustrate how unfalsifiable models and paradigms are always favoured by the Bayes factor…”
- Bayesian model selection without evidences: application to the dark energy equation-of-state, by Sonke Hee et al.:
“A method is presented for Bayesian model selection without explicitly computing evidences … without the need for reversible jump MCMC techniques.”