Archive for evidence

how to translate evidence into French?

Posted in Books, Statistics, University life with tags , , , , , on July 6, 2014 by xi'an

I got this email from Gauvain who writes a PhD in philosophy of sciences a few minutes ago:

L’auteur du texte que j’ai à traduire désigne les facteurs de Bayes comme une “Bayesian measure of evidence”, et les tests de p-value comme une “frequentist measure of evidence”. Je me demandais s’il existait une traduction française reconnue et établie pour cette expression de “measure of evidence”. J’ai rencontré parfois “mesure d’évidence” qui ressemble fort à un anglicisme, et parfois “estimateur de preuve”, mais qui me semble pouvoir mener à des confusions avec d’autres emploi du terme “estimateur”.

which (pardon my French!) wonders how to translate the term evidence into French. It would sound natural that the French évidence is the answer but this is not the case. Despite sharing the same Latin root (evidentia), since the English version comes from medieval French, the two words have different meanings: in English, it means a collection of facts coming to support an assumption or a theory, while in French it means something obvious, which truth is immediately perceived. Surprisingly, English kept the adjective evident with the same [obvious] meaning as the French évident. But the noun moved towards a much less definitive meaning, both in Law and in Science. I had never thought of the huge gap between the two meanings but must have been surprised at its use the first time I heard it in English. But does not think about it any longer, as when I reviewed Seber’s Evidence and Evolution.

One may wonder at the best possible translation of evidence into French. Even though marginal likelihood (vraisemblance marginale) is just fine for statistical purposes. I would suggest faisceau de présomptions or degré de soutien or yet intensité de soupçon as (lengthy) solutions. Soupçon could work as such, but has a fairly negative ring…

Split Sampling: expectations, normalisation and rare events

Posted in Books, Statistics, University life with tags , , , , , , on January 27, 2014 by xi'an

Just before Christmas (a year ago), John Birge, Changgee Chang, and Nick Polson arXived a paper with the above title. Split sampling is presented a a tool conceived to handle rare event probabilities, written in this paper as

Z(m)=\mathbb{E}_\pi[\mathbb{I}\{L(X)>m\}]

where π is the prior and L the likelihood, m being a large enough bound to make the probability small. However, given John Skilling’s representation of the marginal likelihood as the integral of the Z(m)’s, this simulation technique also applies to the approximation of the evidence. The paper refers from the start to nested sampling as a motivation for this method, presumably not as a way to run nested sampling, which was created as a tool for evidence evaluation, but as a competitor. Nested sampling may indeed face difficulties in handling the coverage of the higher likelihood regions under the prior and it is an approximative method, as we detailed in our earlier paper with Nicolas Chopin. The difference between nested and split sampling is that split sampling adds a distribution ω(m) on the likelihood levels m. If pairs (x,m) can be efficiently generated by MCMC for the target

\pi(x)\omega(m)\mathbb{I}\{L(X)>m\},

the marginal density of m can then be approximated by Rao-Blackwellisation. From which the authors derive an estimate of Z(m), since the marginal is actually proportional to ω(m)Z(m). (Because of the Rao-Blackwell argument, I wonder how much this differs from Chib’s 1995 method, i.e. if the split sampling estimator could be expressed as a special case of Chib’s estimator.) The resulting estimator of the marginal also requires a choice of ω(m) such that the associated cdf can be computed analytically. More generally, the choice of ω(m) impacts the quality of the approximation since it determines how often and easily high likelihood regions will be hit. Note also that the conditional π(x|m) is the same as in nested sampling, hence may run into difficulties for complex likelihoods or large datasets.

When reading the beginning of the paper, the remark that “the chain will visit each level roughly uniformly” (p.13) made me wonder at a possible correspondence with the Wang-Landau estimator. Until I read the reference to Jacob and Ryder (2012) on page 16. Once again, I wonder at a stronger link between both papers since the Wang-Landau approach aims at optimising the exploration of the simulation space towards a flat histogram. See for instance Figure 2.

The following part of the paper draws a comparison with both nested sampling and the product estimator of Fishman (1994). I do not fully understand the consequences of the equivalence between those estimators and the split sampling estimator for specific choices of the weight function ω(m). Indeed, it seemed to me that the main point was to draw from a joint density on (x,m) to avoid the difficulties of exploring separately each level set. And also avoiding the approximation issues of nested sampling. As a side remark, the fact that the harmonic mean estimator occurs at several points of the paper makes me worried. The qualification of “poor Monte Carlo error variances properties” is an understatement for the harmonic mean estimator, as it generally has infinite variance and it hence should not be used at all, even as a starting point. The paper does not elaborate much about the cross-entropy method, despite using an example from Rubinstein and Kroese (2004).

In conclusion, an interesting paper that made me think anew about the nested sampling approach, which keeps its fascination over the years! I will most likely use it to build an MSc thesis project this summer in Warwick.

Statistical evidence for revised standards

Posted in Statistics, University life with tags , , , , , , , , , on December 30, 2013 by xi'an

In yet another permutation of the original title (!), Andrew Gelman posted the answer Val Johnson sent him after our (submitted)  letter to PNAS. As Val did not send me a copy (although Andrew did!), I will not reproduce it here and I rather refer the interested readers to Andrews’ blog… In addition to Andrew’s (sensible) points, here are a few idle (post-X’mas and pre-skiing) reflections:

  • “evidence against a false null hypothesis accrues exponentially fast” makes me wonder in which metric this exponential rate (in γ?) occurs;
  • that “most decision-theoretic analyses of the optimal threshold to use for declaring a significant finding would lead to evidence thresholds that are substantially greater than 5 (and probably also greater 25)” is difficult to accept as an argument since there is no trace of a decision-theoretic argument in the whole paper;
  • Val rejects our minimaxity argument on the basis that “[UMPBTs] do not involve minimization of maximum loss” but the prior that corresponds to those tests is minimising the integrated probability of not rejecting at threshold level γ, a loss function integrated against parameter and observation, a Bayes risk in other words… Point masses or spike priors are clearly characteristics of minimax priors. Furthermore, the additional argument that “in most applications, however, a unique loss function/prior distribution combination does not exist” has been used by many to refute the Bayesian perspective and makes me wonder what are the arguments left in using a (pseudo-)Bayesian approach;
  • the next paragraph is pure tautology: the fact that “no other test, based on either a subjectively or objectively specified alternative hypothesis, is as likely to produce a Bayes factor that exceeds the specified evidence threshold” is a paraphrase of the definition of UMPBTs, not an argument. I do not see we should solely “worry about false negatives”, since minimising those should lead to a point mass on the null (or, more seriously, should not lead to the minimax-like selection of the prior under the alternative).

Importance sampling schemes for evidence approximation in mixture models

Posted in R, Statistics, University life with tags , , , , , , , , , on November 27, 2013 by xi'an

boximpJeong Eun (Kate) Lee and I completed this paper, “Importance sampling schemes for evidence approximation in mixture models“, now posted on arXiv. (With the customary one-day lag for posting, making me bemoan the days of yore when arXiv would give a definitive arXiv number at the time of submission.) Kate came twice to Paris in the past years to work with me on this evaluation of Chib’s original marginal likelihood estimate (also called the candidate formula by Julian Besag). And on the improvement proposed by Berkhof, van Mechelen, and Gelman (2003), based on averaging over all permutations, idea that we rediscovered in an earlier paper with Jean-Michel Marin. (And that Andrew seemed to have completely forgotten. Despite being the very first one to publish [in English] a paper on a Gibbs sampler for mixtures.) Given that this averaging can get quite costly, we propose a preliminary step to reduce the number of relevant permutations to be considered in the averaging, removing far-away modes that do not contribute to the Rao-Blackwell estimate and called dual importance sampling. We also considered modelling the posterior as a product of k-component mixtures on the components, following a vague idea I had in the back of my mind for many years, but it did not help. In the above boxplot comparison of estimators, the marginal likelihood estimators are

  1. Chib’s method using T = 5000 samples with a permutation correction by multiplying by k!.
  2. Chib’s method (1), using T = 5000 samples which are randomly permuted.
  3. Importance sampling estimate (7), using the maximum likelihood estimate (MLE) of the latents as centre.
  4. Dual importance sampling using q in (8).
  5. Dual importance sampling using an approximate in (14).
  6. Bridge sampling (3). Here, label switching is imposed in hyperparameters.

On the use of marginal posteriors in marginal likelihood estimation via importance-sampling

Posted in R, Statistics, University life with tags , , , , , , , , , , , , , on November 20, 2013 by xi'an

Perrakis, Ntzoufras, and Tsionas just arXived a paper on marginal likelihood (evidence) approximation (with the above title). The idea behind the paper is to base importance sampling for the evidence on simulations from the product of the (block) marginal posterior distributions. Those simulations can be directly derived from an MCMC output by randomly permuting the components. The only critical issue is to find good approximations to the marginal posterior densities. This is handled in the paper either by normal approximations or by Rao-Blackwell estimates. the latter being rather costly since one importance weight involves B.L computations, where B is the number of blocks and L the number of samples used in the Rao-Blackwell estimates. The time factor does not seem to be included in the comparison studies run by the authors, although it would seem necessary when comparing scenarii.

After a standard regression example (that did not include Chib’s solution in the comparison), the paper considers  2- and 3-component mixtures. The discussion centres around label switching (of course) and the deficiencies of Chib’s solution against the current method and Neal’s reference. The study does not include averaging Chib’s solution over permutations as in Berkoff et al. (2003) and Marin et al. (2005), an approach that does eliminate the bias. Especially for a small number of components. Instead, the authors stick to the log(k!) correction, despite it being known for being quite unreliable (depending on the amount of overlap between modes). The final example is Diggle et al. (1995) longitudinal Poisson regression with random effects on epileptic patients. The appeal of this model is the unavailability of the integrated likelihood which implies either estimating it by Rao-Blackwellisation or including the 58 latent variables in the analysis.  (There is no comparison with other methods.)

As a side note, among the many references provided by this paper, I did not find trace of Skilling’s nested sampling or of safe harmonic means (as exposed in our own survey on the topic).

Follow

Get every new post delivered to your Inbox.

Join 598 other followers