## Split Sampling: expectations, normalisation and rare events

Posted in Books, Statistics, University life with tags , , , , , , on January 27, 2014 by xi'an

Just before Christmas (a year ago), John Birge, Changgee Chang, and Nick Polson arXived a paper with the above title. Split sampling is presented a a tool conceived to handle rare event probabilities, written in this paper as

$Z(m)=\mathbb{E}_\pi[\mathbb{I}\{L(X)>m\}]$

where π is the prior and L the likelihood, m being a large enough bound to make the probability small. However, given John Skilling’s representation of the marginal likelihood as the integral of the Z(m)’s, this simulation technique also applies to the approximation of the evidence. The paper refers from the start to nested sampling as a motivation for this method, presumably not as a way to run nested sampling, which was created as a tool for evidence evaluation, but as a competitor. Nested sampling may indeed face difficulties in handling the coverage of the higher likelihood regions under the prior and it is an approximative method, as we detailed in our earlier paper with Nicolas Chopin. The difference between nested and split sampling is that split sampling adds a distribution ω(m) on the likelihood levels m. If pairs (x,m) can be efficiently generated by MCMC for the target

$\pi(x)\omega(m)\mathbb{I}\{L(X)>m\},$

the marginal density of m can then be approximated by Rao-Blackwellisation. From which the authors derive an estimate of Z(m), since the marginal is actually proportional to ω(m)Z(m). (Because of the Rao-Blackwell argument, I wonder how much this differs from Chib’s 1995 method, i.e. if the split sampling estimator could be expressed as a special case of Chib’s estimator.) The resulting estimator of the marginal also requires a choice of ω(m) such that the associated cdf can be computed analytically. More generally, the choice of ω(m) impacts the quality of the approximation since it determines how often and easily high likelihood regions will be hit. Note also that the conditional π(x|m) is the same as in nested sampling, hence may run into difficulties for complex likelihoods or large datasets.

When reading the beginning of the paper, the remark that “the chain will visit each level roughly uniformly” (p.13) made me wonder at a possible correspondence with the Wang-Landau estimator. Until I read the reference to Jacob and Ryder (2012) on page 16. Once again, I wonder at a stronger link between both papers since the Wang-Landau approach aims at optimising the exploration of the simulation space towards a flat histogram. See for instance Figure 2.

The following part of the paper draws a comparison with both nested sampling and the product estimator of Fishman (1994). I do not fully understand the consequences of the equivalence between those estimators and the split sampling estimator for specific choices of the weight function ω(m). Indeed, it seemed to me that the main point was to draw from a joint density on (x,m) to avoid the difficulties of exploring separately each level set. And also avoiding the approximation issues of nested sampling. As a side remark, the fact that the harmonic mean estimator occurs at several points of the paper makes me worried. The qualification of “poor Monte Carlo error variances properties” is an understatement for the harmonic mean estimator, as it generally has infinite variance and it hence should not be used at all, even as a starting point. The paper does not elaborate much about the cross-entropy method, despite using an example from Rubinstein and Kroese (2004).

In conclusion, an interesting paper that made me think anew about the nested sampling approach, which keeps its fascination over the years! I will most likely use it to build an MSc thesis project this summer in Warwick.

## On the use of marginal posteriors in marginal likelihood estimation via importance-sampling

Posted in R, Statistics, University life with tags , , , , , , , , , , , , , on November 20, 2013 by xi'an

Perrakis, Ntzoufras, and Tsionas just arXived a paper on marginal likelihood (evidence) approximation (with the above title). The idea behind the paper is to base importance sampling for the evidence on simulations from the product of the (block) marginal posterior distributions. Those simulations can be directly derived from an MCMC output by randomly permuting the components. The only critical issue is to find good approximations to the marginal posterior densities. This is handled in the paper either by normal approximations or by Rao-Blackwell estimates. the latter being rather costly since one importance weight involves B.L computations, where B is the number of blocks and L the number of samples used in the Rao-Blackwell estimates. The time factor does not seem to be included in the comparison studies run by the authors, although it would seem necessary when comparing scenarii.

After a standard regression example (that did not include Chib’s solution in the comparison), the paper considers  2- and 3-component mixtures. The discussion centres around label switching (of course) and the deficiencies of Chib’s solution against the current method and Neal’s reference. The study does not include averaging Chib’s solution over permutations as in Berkoff et al. (2003) and Marin et al. (2005), an approach that does eliminate the bias. Especially for a small number of components. Instead, the authors stick to the log(k!) correction, despite it being known for being quite unreliable (depending on the amount of overlap between modes). The final example is Diggle et al. (1995) longitudinal Poisson regression with random effects on epileptic patients. The appeal of this model is the unavailability of the integrated likelihood which implies either estimating it by Rao-Blackwellisation or including the 58 latent variables in the analysis.  (There is no comparison with other methods.)

As a side note, among the many references provided by this paper, I did not find trace of Skilling’s nested sampling or of safe harmonic means (as exposed in our own survey on the topic).

## summary statistics for ABC model choice

Posted in Statistics with tags , , , , , , , , , on March 11, 2013 by xi'an

A few days ago, Dennis Prangle, Paul Fernhead, and their co-authors from New Zealand have posted on arXiv their (long-awaited) study of the selection of summary statistics for ABC model choice. And I read it during my trip to England, in trains and planes, if not when strolling in the beautiful English countryside as above.

As posted several times on this ‘Og, the crux of the analysis is that the Bayes factor is a good type of summary when comparing two models, this result extending to more model by considering instead the vector of evidences. As in the initial Read Paper by Fearnhead and Prangle, there is no true optimality in using the Bayes factor or vector of evidences, strictly speaking, besides the fact that the vector of evidences is minimal sufficient for the marginal models (integrating out the parameters). (This was a point made in my discussion.) The implementation of the principle is similar to this Read Paper setting as well: run a pilot ABC simulation, estimate the vector of evidences, and re-run the main ABC simulation using this estimate as the summary statistic. The paper contains a simulation study using some of our examples (in Marin et al., 2012), as well as an application to genetic bacterial data. Continue reading

## estimating a constant

Posted in Books, Statistics with tags , , , , , , , , , on October 3, 2012 by xi'an

Paulo (a.k.a., Zen) posted a comment in StackExchange on Larry Wasserman‘s paradox about Bayesians and likelihoodists (or likelihood-wallahs, to quote Basu!) being unable to solve the problem of estimating the normalising constant c of the sample density, f, known up to a constant

$f(x) = c g(x)$

(Example 11.10, page 188, of All of Statistics)

My own comment is that, with all due respect to Larry!, I do not see much appeal in this example, esp. as a potential criticism of Bayesians and likelihood-wallahs…. The constant c is known, being equal to

$1/\int_\mathcal{X} g(x)\text{d}x$

If c is the only “unknown” in the picture, given a sample x1,…,xn, then there is no statistical issue whatsoever about the “problem” and I do not agree with the postulate that there exist estimators of c. Nor priors on c (other than the Dirac mass on the above value). This is not in the least a statistical problem but rather a numerical issue.That the sample x1,…,xn can be (re)used through a (frequentist) density estimate to provide a numerical approximation of c

$\hat c = \hat f(x_0) \big/ g(x_0)$

is a mere curiosity. Not a criticism of alternative statistical approaches: e.g., I could also use a Bayesian density estimate…

Furthermore, the estimate provided by the sample x1,…,xn is not of particular interest since its precision is imposed by the sample size n (and converging at non-parametric rates, which is not a particularly relevant issue!), while I could use importance sampling (or even numerical integration) if I was truly interested in c. I however find the discussion interesting for many reasons

1. it somehow relates to the infamous harmonic mean estimator issue, often discussed on the’Og!;
2. it brings more light on the paradoxical differences between statistics and Monte Carlo methods, in that statistics is usually constrained by the sample while Monte Carlo methods have more freedom in generating samples (up to some budget limits). It does not make sense to speak of estimators in Monte Carlo methods because there is no parameter in the picture, only “unknown” constants. Both fields rely on samples and probability theory, and share many features, but there is nothing like a “best unbiased estimator” in Monte Carlo integration, see the case of the “optimal importance function” leading to a zero variance;
3. in connection with the previous point, the fascinating Bernoulli factory problem is not a statistical problem because it requires an infinite sequence of Bernoullis to operate;
4. the discussion induced Chris Sims to contribute to StackExchange!

## Harmonic means, again

Posted in Statistics, University life with tags , , , , , , , , on January 3, 2012 by xi'an

Over the non-vacation and the vacation breaks of the past weeks, I skipped a lot of arXiv postings. This morning, I took a look at “Probabilities of exoplanet signals from posterior samplings” by Mikko Tuomi and Hugh R. A. Jones. This is a paper to appear in Astronomy and Astrophysics, but the main point [to me] is to study a novel approximation to marginal likelihood. The authors propose what looks at first as defensive sampling: given a likelihood f(x|θ) and a corresponding Markov chaini), the approximation is based on the following importance sampling representation

$\hat m(x) = \sum_{i=h+1}^N \dfrac{f(x|\theta_i)}{(1-\lambda) f(x|\theta_i) + \lambda f(x|\theta_{i-h})}\Big/$

$\sum_{i=h+1}^N \dfrac{1}{(1-\lambda) f(x|\theta_i) + \lambda f(x|\theta_{i-h})}$

This is called a truncated posterior mixture approximation and, under closer scrutiny, it is not defensive sampling. Indeed the second part in the denominators does not depend on the parameter θi, therefore, as far as importance sampling is concerned, this is a constant (if random) term! The authors impose a bounded parameter space for this reason, however I do not see why such an approximation is converging. Except when λ=0, of course, which brings us back to the original harmonic mean estimator. (Properly rejected by the authors for having a very slow convergence. Or, more accurately, generally no stronger convergence than the law of large numbers.)  Furthermore, the generic importance sampling argument does not work here since, if

$g(\theta) \propto (1-\lambda) \pi(\theta|x) + \lambda \pi(\theta_{i-h}|x)$

is the importance function, the ratio

$\dfrac{\pi(\theta_i)f(x|\theta_i)}{(1-\lambda) \pi(\theta|x) + \lambda \pi(\theta_{i-h}|x)}$

does not simplify… I do not understand either why the authors compare Bayes factors approximations based on this technique, on the harmonic mean version or on Chib and Jeliazkov’s (2001) solution with both DIC and AIC, since the later are not approximations to the Bayes factors. I am therefore quite surprised at the paper being accepted for publication, given that the numerical evaluation shows the impact of the coefficient λ does not vanish with the number of simulations. (Which is logical given the bias induced by the additional term.)

## Bayesian inference and the parametric bootstrap

Posted in R, Statistics, University life with tags , , , , , , , , on December 16, 2011 by xi'an

This paper by Brad Efron came to my knowledge when I was looking for references on Bayesian bootstrap to answer a Cross Validated question. After reading it more thoroughly, “Bayesian inference and the parametric bootstrap” puzzles me, which most certainly means I have missed the main point. Indeed, the paper relies on parametric bootstrap—a frequentist approximation technique mostly based on simulation from a plug-in distribution and a robust inferential method estimating distributions from empirical cdfs—to assess (frequentist) coverage properties of Bayesian posteriors. The manuscript mixes a parametric bootstrap simulation output for posterior inference—even though bootstrap produces simulations of estimators while the posterior distribution operates on the parameter space, those  estimator simulations can nonetheless be recycled as parameter simulation by a genuine importance sampling argument—and the coverage properties of Jeffreys posteriors vs. the BCa [which stands for bias-corrected and accelerated, see Efron 1987] confidence density—which truly take place in different spaces. Efron however connects both spaces by taking advantage of the importance sampling connection and defines a corrected BCa prior to make the confidence intervals match. While in my opinion this does not define a prior in the Bayesian sense, since the correction seems to depend on the data. And I see no strong incentive to match the frequentist coverage, because this would furthermore define a new prior for each component of the parameter. This study about the frequentist properties of Bayesian credible intervals reminded me of the recent discussion paper by Don Fraser on the topic, which follows the same argument that Bayesian credible regions are not necessarily good frequentist confidence intervals.

The conclusion of the paper is made of several points, some of which may not be strongly supported by the previous analysis:

1. “The parametric bootstrap distribution is a favorable starting point for importance sampling computation of Bayes posterior distributions.” [I am not so certain about this point given that the bootstrap is based on a pluggin estimate, hence fails to account for the variability of this estimate, and may thus induce infinite variance behaviour, as in the harmonic mean estimator of Newton and Raftery (1994). Because the tails of the importance density are those of the likelihood, the heavier tails of the posterior induced by the convolution with the prior distribution are likely to lead to this fatal misbehaviour of the importance sampling estimator.]
2. “This computation is implemented by reweighting the bootstrap replications rather than by drawing observations directly from the posterior distribution as with MCMC.” [Computing the importance ratio requires the availability both of the likelihood function and of the likelihood estimator, which means a setting where Bayesian computations are not particularly hindered and do not necessarily call for advanced MCMC schemes.]
3. “The necessary weights are easily computed in exponential families for any prior, but are particularly simple starting from Jeffreys invariant prior, in which case they depend only on the deviance difference.” [Always from a computational perspective, the ease of computing the importance weights is mirrored by the ease in handling the posterior distributions.]
4. “The deviance difference depends asymptotically on the skewness of the family, having a cubic normal form.” [No relevant comment.]
5. “In our examples, Jeffreys prior yielded posterior distributions not much different than the unweighted bootstrap distribution. This may be unsatisfactory for single parameters of interest in multi-parameter families.” [The frequentist confidence properties of Jeffreys priors have already been examined in the past and be found to be lacking in multidimensional settings. This is an assessment finding Jeffreys priors lacking from a frequentist perspective. However, the use of Jeffreys prior is not justified on this particular ground.]
6. “Better uninformative priors, such as the Welch and Peers family or reference priors, are closely related to the frequentist BCa reweighting formula.” [The paper only finds proximities in two examples, but it does not assess this relation in a wider generality. Again, this is not particularly relevant from a Bayesian viewpoint.]
7. “Because of the i.i.d. nature of bootstrap resampling, simple formulas exist for the accuracy of posterior computations as a function of the number B of bootstrap replications. Even with excessive choices of B, computation time was measured in seconds for our examples.” [This is not very surprising. It however assesses Bayesian procedures from a frequentist viewpoint, so this may be lost on both Bayesian and frequentist users...]
8. “An efficient second-level bootstrap algorithm (“bootstrap-after-bootstrap”) provides estimates for the frequentist accuracy of Bayesian inferences.” [This is completely correct and why bootstrap is such an appealing technique for frequentist inference. I spent the past two weeks teaching non-parametric bootstrap to my R class and the students are now fluent with the concept, even though they are unsure about the meaning of estimation and testing!]
9. “This can be important in assessing inferences based on formulaic priors, such as those of Jeffreys, rather than on genuine prior experience.” [Again, this is neither very surprising nor particularly appealing to Bayesian users.]

In conclusion, I found the paper quite thought-provoking and stimulating, definitely opening new vistas in a very elegant way. I however remain unconvinced by the simulation aspects from a purely Monte Carlo perspective.

## R exam

Posted in Kids, pictures, Statistics, University life with tags , , , , , , , on November 28, 2011 by xi'an

Following a long tradition (!) of changing the modus vivendi of each exam in our exploratory statistics with R class, we decided this year to give the students a large collection of exercises prior to the exam and to pick five among them to the exam, the students having to solve two and only two of them. (The exercises are available in French on my webpage.) This worked beyond our expectations in that the overwhelming majority of students went over all the exercises and did really (too) well at the exam! Next year, we will hopefully increase the collection of exercises and also prohibit written notes during the exam (to avoid a possible division of labour among the students).

Incidentally, we found a few (true) gems in the solutions, incl. an harmonic mean resolution of the approximation of the integral

$\int_2^\infty x^4 e^{-x}\,\text{d}x=\Gamma(5,2)$

since some students generated from the distribution with density f proportional to the integrand over [2,∞) [a truncated gamma] and then took the estimator

$\dfrac{1-e^{-2}}{\frac{1}{n}\,\sum_{i=1}^n y_i^{-4}}\approx\dfrac{\int_2^\infty e^{-x}\,\text{d}x}{\mathbb{E}[X^{-4}]}\quad\text{when}\quad X\sim f$

although we expected them to simulate directly from the exponential and average the sample to the fourth power… In this specific situation, the (dreaded) harmonic mean estimator has a finite variance! To wit;

> y=rgamma(shape=5,n=10^5)
> pgamma(2,5,low=FALSE)*gamma(5)
[1] 22.73633
> integrate(f=function(x){x^4*exp(-x)},2,Inf)
22.73633 with absolute error < 0.0017
> pgamma(2,1,low=FALSE)/mean(y[y>2]^{-4})
[1] 22.92461
> z=rgamma(shape=1,n=10^5)
> mean((z>2)*z^4)
[1] 23.92876


So the harmonic means does better than the regular Monte Carlo estimate in this case!