## Estimating the number of species

Posted in Statistics with tags , , , , , on November 20, 2009 by xi'an

Bayesian Analysis just published on-line a paper by Hongmei Zhang and Hal Stern on a (new) Bayesian analysis of the problem of estimating the number of unseen species within a population. This problem has always fascinated me, as it seems at first sight to be an impossible problem, how can you estimate the number of species you do not know?! The approach relates to capture-recapture models, with an extra hierarchical layer for the species. The Bayesian analysis of the model obviously makes a lot of sense, with the prior modelling being quite influential. Zhang and Stern use a hierarchical Dirichlet prior on the capture probabilities, $\theta_i$, when the captures follow a multinomial model

$y|\theta,S \sim \mathcal{M}(N, \theta_1,\ldots,\theta_S)$

where $N=\sum_i y_i$ the total number of observed individuals,

$\mathbf{\theta}|S \sim \mathcal{D}(\alpha,\ldots,\alpha)$

and

$\pi(\alpha,S) = f(1-f)^{S-S_\text{min}} \alpha^{-3/2}$

forcing the coefficients of the Dirichlet prior towards zero. The paper also covers predictive design, analysing the capture effort corresponding to a given recovery rate of species. The overall approach is not immensely innovative in its methodology, the MCMC part being rather straightforward, but the predictive abilities of the model are nonetheless interesting.

The previously accepted paper in Bayesian Analysis is a note by Ron Christensen about an inconsistent Bayes estimator that you may want to use in an advanced Bayesian class. For all practical purposes, it should not overly worry you, since the example involves a sampling distribution that is normal when its parameter is irrational and is Cauchy otherwise. (The prior is assumed to be absolutely continuous wrt the Lebesgue measure and it thus gives mass zero to the set of rational numbers $\mathbb{Q}$. The fact that $\mathbb{Q}$ is dense in $\mathbb{R}$ is irrelevant from a measure-theoretic viewpoint.)

## Predictive Bayes factors?!

Posted in Statistics with tags , , , , , on September 11, 2009 by xi'an

We (as in we, the Cosmology/Statistics ANR 2005-2009 Ecosstat grant team) are currently working on a Bayesian testing paper with applications to cosmology and my colleagues showed me a paper by Roberto Trotta that I found most intriguing i its introduction of a predictive Bayes factor. A Bayes factor being a function of an observed $x$ or future $x^\prime$ dataset can indeed be predicted (for the latter) in a Bayesian fashion but I find difficult to make sense of the corresponding distribution from an inferential perspective. Here are a few points in the paper to which I object:

• The Bayes factor associated with $x^\prime$ should be based on $x$ as well if it is to work as a genuine Bayes factor. Otherwise, the information contained in $x$ is ignored;
• While a Bayes factor eliminates the influence of the prior probabilities of the null and of the alternative hypotheses, the predictive distribution of $x^\prime$ does not:

$x^\prime | x \sim p(H_0) m_0(x,x^\prime) + p(H_a) m_a(x,x^\prime)$

• The most natural use of the predictive distribution of $B(x,x^\prime)$ would be to look at the mass above or below 1, thus to produce a sort of Bayesian predictive p-value, falling back into old tracks.
• If the current observation $x$ is not integrated in the future Bayes factor $B(x^\prime)$, it should be incorporated in the prior, the current posterior being then the future prior. In this case, the quantity of interest is not the predictive of $B(x^\prime)$ but of

$B(x,x^\prime) / B(x).$

It may be that the disappearance of $x$ from the Bayes factor stems from a fear of “using the data twice“, which is a recurring argument in the criticisms of predictive Bayes inference. I have difficulties with the concept in general and, in the present case, there is no difficulty with using $\pi(x^\prime| x)$ to predict the distribution of $B(x,x^\prime)$.

I also am puzzled by the MCMC strategy suggested in the paper in the case of embedded hypotheses. Trotta argues in §3.1 that it is sufficient to sample from the full model and to derive the Bayes factor by the Savage-Dickey representation, but this does not really agree with the approach of Chen, Shao and Ibrahim, while I think the identity (14) is missing an extra term, namely

$\dfrac{p(d|M_0)p(\omega_\star|M_1)}{p(d|M_1)},$

which has the surprising feature of depending upon the value of the prior density at a specific value $\omega_\star$… (Details are in the reproduced pages of my notebook, above, that can be enlarged by clicking on “View Image” and then moving “w=188&h=694&h=261″ to “w=1188&h=694&h=1261″ in the page title.) Overall, I find most puzzling that simulating from a distribution over a set $\Theta$ provides information about a distribution that is concentrated over a subset $\Theta_0$ and that has measure zero against the initial measure. (I am actually suspicious of the Savage-Dickey representation itself, because it also uses the value of the prior and posterior densities at a given value $\omega_\star$, even though it has a very nice Gibbs interpretation/implementation…)

## Influenza predictions

Posted in Statistics with tags , , on July 1, 2009 by xi'an

Following yesterday’s [rather idle] post, Alessandra Iacobucci pointed out the sites of Research on Complex Systems at Northwestern and of GLEaM at Indiana University that propose projections on the epidemic. I have not had time so far to check for details on those projections (my talk for this morning session is not yet completed!).

I would have liked to see those maps in terms of chances of catching the flu rather than sheer numbers as they represent population sizes as much as the prevalence of the flu. The rudimentary division of the number of predicted cases by the population size would be a first step.

## Good size swans and turkeys

Posted in Books, Statistics with tags , , , , on February 24, 2009 by xi'an

In connection with The Black Swan, Nassim Taleb wrote a small essay called The Fourth Quadrant on The Edge. I found it much more pleasant to read than the book because (a) it directly focus on the difficulty of dealing with fat tail distributions and the prediction of extreme events, and (b) it is delivered in a much more serene tone than the book (imagine, just the single remark about the Frenchs!). The text contains insights on loss functions and inverse problems which, even though they are a bit vague, do mostly make sense. As for The Black Swan, I deplore (a) the underlying determinism of the author, which still seems to believe in an unknown (and possibly unachievable) model that would rule the phenomenon under study and (b) the lack of temporal perspective and of the possibility of modelling jumps as changepoints, i.e. model shifts. Time series have no reason to be stationary, the less so the more they depend on all kinds of exogeneous factors. I actually agree with Taleb that, if there is no information about the form of the tails of the distribution corresponding to the phenomenon under study—assuming there does exist a distribution—, estimating the shape of this tail from the raw data is impossible.

The essay is followed by a technical appendix that expands on fat tails, but not so deeply as to be extremely interesting. A surprising side note is that Taleb seems to associate stochastic volatility with mixtures of Gaussians. In my personal book of models, stochastic volatility is a noisy observation of the exponential of a random walk, something like$\nu_t={\exp(ax_{t-1}+b\epsilon_t)},$thus with much higher variation (and possibly no moments). To state that Student’s t distributions are more variable than stochastic volatility models is therefore unusual… There is also an analysis over a bizillion datasets of the insanity of computing kurtosis when the underlying distribution may not have even a second moment. I could not agree more: trying to summarise fat tail distributions by their four first moments does not make sense, even though it may sell well. The last part of the appendix shows the equal lack of stability of estimates of the tail index${\alpha},$which again is not a surprising phenomenon: if the tail bound K is too low, it may be that the power law has not yet quicked in while, if it is too large, then we always end up with not enough data. The picture shows how the estimate widely varies with K around its theoretical value for the log-normal and three Pareto distributions, based on a million simulations. (And this is under the same assumption of stationarity as above.) So I am not sure what the message is there. (As an aside, there seems to be a mistake in the tail expectation: it should be

$\dfrac{\int_K^\infty x x^{-\alpha} dx}{\int_K^\infty x^{-\alpha} dx} = \dfrac{K(\alpha-1)}{(\alpha-2)}$

if the density decreases in$\alpha\cdots$It is correct when$\alpha$is the tail power of the cdf.)