## Split Sampling: expectations, normalisation and rare events

Posted in Books, Statistics, University life with tags , , , , , , on January 27, 2014 by xi'an

Just before Christmas (a year ago), John Birge, Changgee Chang, and Nick Polson arXived a paper with the above title. Split sampling is presented a a tool conceived to handle rare event probabilities, written in this paper as

$Z(m)=\mathbb{E}_\pi[\mathbb{I}\{L(X)>m\}]$

where π is the prior and L the likelihood, m being a large enough bound to make the probability small. However, given John Skilling’s representation of the marginal likelihood as the integral of the Z(m)’s, this simulation technique also applies to the approximation of the evidence. The paper refers from the start to nested sampling as a motivation for this method, presumably not as a way to run nested sampling, which was created as a tool for evidence evaluation, but as a competitor. Nested sampling may indeed face difficulties in handling the coverage of the higher likelihood regions under the prior and it is an approximative method, as we detailed in our earlier paper with Nicolas Chopin. The difference between nested and split sampling is that split sampling adds a distribution ω(m) on the likelihood levels m. If pairs (x,m) can be efficiently generated by MCMC for the target

$\pi(x)\omega(m)\mathbb{I}\{L(X)>m\},$

the marginal density of m can then be approximated by Rao-Blackwellisation. From which the authors derive an estimate of Z(m), since the marginal is actually proportional to ω(m)Z(m). (Because of the Rao-Blackwell argument, I wonder how much this differs from Chib’s 1995 method, i.e. if the split sampling estimator could be expressed as a special case of Chib’s estimator.) The resulting estimator of the marginal also requires a choice of ω(m) such that the associated cdf can be computed analytically. More generally, the choice of ω(m) impacts the quality of the approximation since it determines how often and easily high likelihood regions will be hit. Note also that the conditional π(x|m) is the same as in nested sampling, hence may run into difficulties for complex likelihoods or large datasets.

When reading the beginning of the paper, the remark that “the chain will visit each level roughly uniformly” (p.13) made me wonder at a possible correspondence with the Wang-Landau estimator. Until I read the reference to Jacob and Ryder (2012) on page 16. Once again, I wonder at a stronger link between both papers since the Wang-Landau approach aims at optimising the exploration of the simulation space towards a flat histogram. See for instance Figure 2.

The following part of the paper draws a comparison with both nested sampling and the product estimator of Fishman (1994). I do not fully understand the consequences of the equivalence between those estimators and the split sampling estimator for specific choices of the weight function ω(m). Indeed, it seemed to me that the main point was to draw from a joint density on (x,m) to avoid the difficulties of exploring separately each level set. And also avoiding the approximation issues of nested sampling. As a side remark, the fact that the harmonic mean estimator occurs at several points of the paper makes me worried. The qualification of “poor Monte Carlo error variances properties” is an understatement for the harmonic mean estimator, as it generally has infinite variance and it hence should not be used at all, even as a starting point. The paper does not elaborate much about the cross-entropy method, despite using an example from Rubinstein and Kroese (2004).

In conclusion, an interesting paper that made me think anew about the nested sampling approach, which keeps its fascination over the years! I will most likely use it to build an MSc thesis project this summer in Warwick.

## Statistical evidence for revised standards

Posted in Statistics, University life with tags , , , , , , , , , on December 30, 2013 by xi'an

In yet another permutation of the original title (!), Andrew Gelman posted the answer Val Johnson sent him after our (submitted)  letter to PNAS. As Val did not send me a copy (although Andrew did!), I will not reproduce it here and I rather refer the interested readers to Andrews’ blog… In addition to Andrew’s (sensible) points, here are a few idle (post-X’mas and pre-skiing) reflections:

• “evidence against a false null hypothesis accrues exponentially fast” makes me wonder in which metric this exponential rate (in γ?) occurs;
• that “most decision-theoretic analyses of the optimal threshold to use for declaring a significant finding would lead to evidence thresholds that are substantially greater than 5 (and probably also greater 25)” is difficult to accept as an argument since there is no trace of a decision-theoretic argument in the whole paper;
• Val rejects our minimaxity argument on the basis that “[UMPBTs] do not involve minimization of maximum loss” but the prior that corresponds to those tests is minimising the integrated probability of not rejecting at threshold level γ, a loss function integrated against parameter and observation, a Bayes risk in other words… Point masses or spike priors are clearly characteristics of minimax priors. Furthermore, the additional argument that “in most applications, however, a unique loss function/prior distribution combination does not exist” has been used by many to refute the Bayesian perspective and makes me wonder what are the arguments left in using a (pseudo-)Bayesian approach;
• the next paragraph is pure tautology: the fact that “no other test, based on either a subjectively or objectively specified alternative hypothesis, is as likely to produce a Bayes factor that exceeds the specified evidence threshold” is a paraphrase of the definition of UMPBTs, not an argument. I do not see we should solely “worry about false negatives”, since minimising those should lead to a point mass on the null (or, more seriously, should not lead to the minimax-like selection of the prior under the alternative).

## Importance sampling schemes for evidence approximation in mixture models

Posted in R, Statistics, University life with tags , , , , , , , , , on November 27, 2013 by xi'an

Jeong Eun (Kate) Lee and I completed this paper, “Importance sampling schemes for evidence approximation in mixture models“, now posted on arXiv. (With the customary one-day lag for posting, making me bemoan the days of yore when arXiv would give a definitive arXiv number at the time of submission.) Kate came twice to Paris in the past years to work with me on this evaluation of Chib’s original marginal likelihood estimate (also called the candidate formula by Julian Besag). And on the improvement proposed by Berkhof, van Mechelen, and Gelman (2003), based on averaging over all permutations, idea that we rediscovered in an earlier paper with Jean-Michel Marin. (And that Andrew seemed to have completely forgotten. Despite being the very first one to publish [in English] a paper on a Gibbs sampler for mixtures.) Given that this averaging can get quite costly, we propose a preliminary step to reduce the number of relevant permutations to be considered in the averaging, removing far-away modes that do not contribute to the Rao-Blackwell estimate and called dual importance sampling. We also considered modelling the posterior as a product of k-component mixtures on the components, following a vague idea I had in the back of my mind for many years, but it did not help. In the above boxplot comparison of estimators, the marginal likelihood estimators are

1. Chib’s method using T = 5000 samples with a permutation correction by multiplying by k!.
2. Chib’s method (1), using T = 5000 samples which are randomly permuted.
3. Importance sampling estimate (7), using the maximum likelihood estimate (MLE) of the latents as centre.
4. Dual importance sampling using q in (8).
5. Dual importance sampling using an approximate in (14).
6. Bridge sampling (3). Here, label switching is imposed in hyperparameters.

## On the use of marginal posteriors in marginal likelihood estimation via importance-sampling

Posted in R, Statistics, University life with tags , , , , , , , , , , , , , on November 20, 2013 by xi'an

Perrakis, Ntzoufras, and Tsionas just arXived a paper on marginal likelihood (evidence) approximation (with the above title). The idea behind the paper is to base importance sampling for the evidence on simulations from the product of the (block) marginal posterior distributions. Those simulations can be directly derived from an MCMC output by randomly permuting the components. The only critical issue is to find good approximations to the marginal posterior densities. This is handled in the paper either by normal approximations or by Rao-Blackwell estimates. the latter being rather costly since one importance weight involves B.L computations, where B is the number of blocks and L the number of samples used in the Rao-Blackwell estimates. The time factor does not seem to be included in the comparison studies run by the authors, although it would seem necessary when comparing scenarii.

After a standard regression example (that did not include Chib’s solution in the comparison), the paper considers  2- and 3-component mixtures. The discussion centres around label switching (of course) and the deficiencies of Chib’s solution against the current method and Neal’s reference. The study does not include averaging Chib’s solution over permutations as in Berkoff et al. (2003) and Marin et al. (2005), an approach that does eliminate the bias. Especially for a small number of components. Instead, the authors stick to the log(k!) correction, despite it being known for being quite unreliable (depending on the amount of overlap between modes). The final example is Diggle et al. (1995) longitudinal Poisson regression with random effects on epileptic patients. The appeal of this model is the unavailability of the integrated likelihood which implies either estimating it by Rao-Blackwellisation or including the 58 latent variables in the analysis.  (There is no comparison with other methods.)

As a side note, among the many references provided by this paper, I did not find trace of Skilling’s nested sampling or of safe harmonic means (as exposed in our own survey on the topic).

## my DICussion

Posted in Books, Kids, pictures, Statistics, University life with tags , , , , , , , on September 25, 2013 by xi'an

Following the Re-Reading of Spiegelhalter et al. (2002) by David at the RSS Annual Conference a few weeks ago, and my invited discussion there, I was asked to contribute a written discussion to Series B, a request obviously impossible to refuse!

The main issue with DIC is the question of its worth for Bayesian decision analysis (since I doubt there are many proponents of DIC outside the Bayesian community). The appeal of DIC is, I presume, to deliver a single summary per model under comparison and to allow therefore for a complete ranking of those models. I however object at the worth of simplicity for simplicity’s sake: models are complex (albeit less than reality) and their usages are complex as well. To consider that model A is to be preferred upon model B just because DIC(A)=1228 < DiC(B)=1237 is a mimicry of the complex mechanisms at play behind model choice, especially given the wealth of information provided by a Bayesian framework. (Non-Bayesian paradigms are more familiar with procedures based on a single estimator value.) And to abstain from accounting for the significance of the difference between DIC(A) and DIC(B) clearly makes matters worse.

This is not even discussing the stylised setting where one model is considered as “true” and where procedures are compared by their ability to recover the “truth”. David Spiegelhalter repeatedly mentioned during his talk that he was not interested in this. This stance brings another objection, though, namely that models can only be compared against their predictive abilities, which DIC seems unable to capture. Once again, what is needed is a multi-factor and all-encompassing criterion that evaluates the predictive models in terms of their recovery of some features of the phenomenon under study. Or of the process being conducted. (Even stooping down to a one-dimensional loss function that is supposed to summarise the purpose of the model comparison does not produce anything close to the DIC function.)

Obviously, considering that asymptotic consistency is of no importance whatsoever (as repeated by David in Newcastle) avoids some embarrassing questions, except the one about the true purpose of statistical models and procedures. How can they be compared if no model is true and if accumulating data from a given model is not meaningful? How can simulation be conducted in such a barren landscape?  I find it the more difficult to accept this minimalist attitude that models are truly used as if they were or could be true, at several stages in the process. It also prevents the study of the criterion under model misspecification, which would clearly be of interest.

Another point, already exposed in our 2006 Bayesian Analysis paper with Gilles Celeux, Florence Forbes, and Mike Titterington, is that there is no unique driving principle for constructing DICs. In that paper, we examined eight different and natural versions of DIC for mixture models, resulting in highly diverging values for DIC and the effective dimension of the parameter, I believe that such a lack of focus is bound to reappear in any multimodal setting and fear that the answer about (eight) different focus on what matters in the model is too cursory and lacks direction for the hapless practitioner.

My final remark about DIC is that it shares very much the same perspective as Murray Aitkin’s integrated likelihood, Both Aitkin (1991, 2009) and Spiegelhalter et al. (2002) consider a posterior distribution on the likelihood function, taken as a function of the parameter but omitting the delicate fact that it also depends on the observable and hence does not exist a priori. We wrote a detailed review of Aitkin’s (2009) book, where most of the criticisms equally apply to DIC, and I will not repeat them here, except for pointing out that it escapes the Bayesian framework (and thus requires even more its own justifications).