Archive for prediction

years (and years) of data science

Posted in Books, Statistics, Travel, University life with tags , , , , , , , , , , , , , , on January 4, 2016 by xi'an

In preparation for the round table at the start of the MCMSkv conference, this afternoon, Anto sent us a paper written by David Donoho for the Tukey Centennial workshop, held in Princeton last September. Entitled 50 years of Data Science. And which attracted a whole round of comments, judging from the Google search results. So much that I decided not to read any of them before parsing through the paper. But almost certainly reproducing here with my two cents some of the previous comments.

“John Tukey’s definition of `Big Data’ was `anything that won’t fit on one device’.”

The complaint that data science is essentially statistics that does not dare to spell out statistics as if it were a ten letter word (p.5) is not new, if appropriate. In this paper, David Donoho evacuates the memes that supposedly separate data science from statistics, like “big data” (although I doubt non-statisticians would accept the quick rejection that easily, wondering at the ability of statisticians to develop big models), skills like parallel programming (which ineluctably leads to more rudimentary algorithms and inferential techniques), jobs requiring such a vast array of skills and experience that no graduate student sounds properly trained for it…

“A call to action, from a statistician who fells `the train is leaving the station’.” (p.12)

One point of the paper is to see 1962 John Tukey’s “The Future of Data Analysis” as prophetical of the “Big Data” and “Data Science” crises. Which makes a lot of sense when considering the four driving forces advanced by Tukey (p.11):

  1. formal statistics
  2. advanced computing and graphical devices
  3. the ability to face ever-growing data flows
  4. its adoption by an ever-wider range of fields

“Science about data science will grow dramatically in significance.”

David Donoho then moves on to incorporate   Leo Breiman’s 2001 Two Cultures paper. Which separates machine learning and prediction from statistics and inference, leading to the “big chasm”! And he sees the combination of prediction with “common task framework” as the “secret sauce” of machine learning, because of the possibility of objective comparison of methods on a testing dataset. Which does not seem to me as the explanation for the current (real or perceived) disaffection for statistics and correlated attraction for more computer-related solutions. A code that wins a Kaggle challenge clearly has some efficient characteristics, but this tells me nothing of the abilities of the methodology behind that code. If any. Self-learning how to play chess within 72 hours is great, but is the principle behind able to handle go at the same level?  Plus, I remain worried about the (screaming) absence of model (or models) in predictive approaches. Or at least skeptical. For the same reason it does not help in producing a generic approach to problems. Nor an approximation to the underlying mechanism. I thus see nothing but a black box in many “predictive models”, which tells me nothing about the uncertainty, imprecision or reproducibility of such tools. “Tool evaluation” cannot be reduced to a final score on a testing benchmark. The paper concludes with the prediction that the validation of scientific methodology will solely be empirical (p.37). This leaves little ground if any for probability and uncertainty quantification, as reflected their absence in the paper.

terrible graph of the day

Posted in Books, Kids, R, Statistics with tags , , , , , , on May 12, 2015 by xi'an

A truly terrible graph in Le Monde about overweight and obesity in the EU countries (and Switzerland). The circle presentation makes no logical sense. Countries are ordered by 2030 overweight percentages, which implies the order differs for men and women. (With a neat sexist differentiation between male and female figures.)  The allocation of the (2010) grey bar to its country is unclear (left or right?). And there is no uncertain associated with the 2030 predictions. There is no message coming out of the graph, like the massive explosion in the obesity and overweight percentages in EU countries. Now, given that the data is available for women and men, ‘Og’s readers should feel free to send me alternative representations!

Estimating the number of species

Posted in Statistics with tags , , , , , on November 20, 2009 by xi'an

Bayesian Analysis just published on-line a paper by Hongmei Zhang and Hal Stern on a (new) Bayesian analysis of the problem of estimating the number of unseen species within a population. This problem has always fascinated me, as it seems at first sight to be an impossible problem, how can you estimate the number of species you do not know?! The approach relates to capture-recapture models, with an extra hierarchical layer for the species. The Bayesian analysis of the model obviously makes a lot of sense, with the prior modelling being quite influential. Zhang and Stern use a hierarchical Dirichlet prior on the capture probabilities, \theta_i, when the captures follow a multinomial model

y|\theta,S \sim \mathcal{M}(N, \theta_1,\ldots,\theta_S)

where N=\sum_i y_i the total number of observed individuals,

\mathbf{\theta}|S \sim \mathcal{D}(\alpha,\ldots,\alpha)

and

\pi(\alpha,S) = f(1-f)^{S-S_\text{min}} \alpha^{-3/2}

forcing the coefficients of the Dirichlet prior towards zero. The paper also covers predictive design, analysing the capture effort corresponding to a given recovery rate of species. The overall approach is not immensely innovative in its methodology, the MCMC part being rather straightforward, but the predictive abilities of the model are nonetheless interesting.

The previously accepted paper in Bayesian Analysis is a note by Ron Christensen about an inconsistent Bayes estimator that you may want to use in an advanced Bayesian class. For all practical purposes, it should not overly worry you, since the example involves a sampling distribution that is normal when its parameter is irrational and is Cauchy otherwise. (The prior is assumed to be absolutely continuous wrt the Lebesgue measure and it thus gives mass zero to the set of rational numbers \mathbb{Q}. The fact that \mathbb{Q} is dense in \mathbb{R} is irrelevant from a measure-theoretic viewpoint.)

Predictive Bayes factors?!

Posted in Statistics with tags , , , , , on September 11, 2009 by xi'an

page53, bloc5page54, bloc5

We (as in we, the Cosmology/Statistics ANR 2005-2009 Ecosstat grant team) are currently working on a Bayesian testing paper with applications to cosmology and my colleagues showed me a paper by Roberto Trotta that I found most intriguing i its introduction of a predictive Bayes factor. A Bayes factor being a function of an observed x or future x^\prime dataset can indeed be predicted (for the latter) in a Bayesian fashion but I find difficult to make sense of the corresponding distribution from an inferential perspective. Here are a few points in the paper to which I object:

  • The Bayes factor associated with x^\prime should be based on x as well if it is to work as a genuine Bayes factor. Otherwise, the information contained in x is ignored;
  • While a Bayes factor eliminates the influence of the prior probabilities of the null and of the alternative hypotheses, the predictive distribution of x^\prime does not:

x^\prime | x \sim p(H_0) m_0(x,x^\prime) + p(H_a) m_a(x,x^\prime)

  • The most natural use of the predictive distribution of B(x,x^\prime) would be to look at the mass above or below 1, thus to produce a sort of Bayesian predictive p-value, falling back into old tracks.
  • If the current observation x is not integrated in the future Bayes factor B(x^\prime), it should be incorporated in the prior, the current posterior being then the future prior. In this case, the quantity of interest is not the predictive of B(x^\prime) but of

B(x,x^\prime) / B(x).

It may be that the disappearance of x from the Bayes factor stems from a fear of “using the data twice“, which is a recurring argument in the criticisms of predictive Bayes inference. I have difficulties with the concept in general and, in the present case, there is no difficulty with using \pi(x^\prime| x) to predict the distribution of B(x,x^\prime).

I also am puzzled by the MCMC strategy suggested in the paper in the case of embedded hypotheses. Trotta argues in §3.1 that it is sufficient to sample from the full model and to derive the Bayes factor by the Savage-Dickey representation, but this does not really agree with the approach of Chen, Shao and Ibrahim, while I think the identity (14) is missing an extra term, namely

\dfrac{p(d|M_0)p(\omega_\star|M_1)}{p(d|M_1)},

which has the surprising feature of depending upon the value of the prior density at a specific value \omega_\star… (Details are in the reproduced pages of my notebook, above, that can be enlarged by clicking on “View Image” and then moving “w=188&h=694&h=261″ to “w=1188&h=694&h=1261” in the page title.) Overall, I find most puzzling that simulating from a distribution over a set \Theta provides information about a distribution that is concentrated over a subset \Theta_0 and that has measure zero against the initial measure. (I am actually suspicious of the Savage-Dickey representation itself, because it also uses the value of the prior and posterior densities at a given value \omega_\star, even though it has a very nice Gibbs interpretation/implementation…)

Influenza predictions

Posted in Statistics with tags , , on July 1, 2009 by xi'an

Following yesterday’s [rather idle] post, Alessandra Iacobucci pointed out the sites of Research on Complex Systems at Northwestern and of GLEaM at Indiana University that propose projections on the epidemic. I have not had time so far to check for details on those projections (my talk for this morning session is not yet completed!).

I would have liked to see those maps in terms of chances of catching the flu rather than sheer numbers as they represent population sizes as much as the prevalence of the flu. The rudimentary division of the number of predicted cases by the population size would be a first step.

Good size swans and turkeys

Posted in Books, Statistics with tags , , , , on February 24, 2009 by xi'an

In connection with The Black Swan, Nassim Taleb wrote a small essay called The Fourth Quadrant on The Edge. I found it much more pleasant to read than the book because (a) it directly focus on the difficulty of dealing with fat tail distributions and the prediction of extreme events, and (b) it is delivered in a much more serene tone than the book (imagine, just the single remark about the Frenchs!). The text contains insights on loss functions and inverse problems which, even though they are a bit vague, do mostly make sense. As for The Black Swan, I deplore (a) the underlying determinism of the author, which still seems to believe in an unknown (and possibly unachievable) model that would rule the phenomenon under study and (b) the lack of temporal perspective and of the possibility of modelling jumps as changepoints, i.e. model shifts. Time series have no reason to be stationary, the less so the more they depend on all kinds of exogeneous factors. I actually agree with Taleb that, if there is no information about the form of the tails of the distribution corresponding to the phenomenon under study—assuming there does exist a distribution—, estimating the shape of this tail from the raw data is impossible.

The essay is followed by a technical appendix that expands on fat tails, but not so deeply as to be extremely interesting. A surprising side note is that Taleb seems to associate stochastic volatility with mixtures of Gaussians. In my personal book of models, stochastic volatility is a noisy observation of the exponential of a random walk, something like\nu_t={\exp(ax_{t-1}+b\epsilon_t)},thus with much higher variation (and possibly no moments). To state that Student’s t distributions are more variable than stochastic volatility models is therefore unusual… There is also an analysis over a bizillion datasets of the insanity of computing kurtosis when the underlying distribution may not have even a second moment. I could not agree more: trying to summarise fat tail distributions by their four first moments does not make sense, even though it may sell well. The last part of the appendix shows the equal lack of stability of estimates of the tail index{\alpha},which again is not a surprising phenomenon: if the tail bound K is too low, it may be that the power law has not yet quicked in while, if it is too large, then we always end up with not enough data. The picture shows how the estimate widely varies with K around its theoretical value for the log-normal and three Pareto distributions, based on a million simulations. (And this is under the same assumption of stationarity as above.) So I am not sure what the message is there. (As an aside, there seems to be a mistake in the tail expectation: it should be

\dfrac{\int_K^\infty x x^{-\alpha} dx}{\int_K^\infty x^{-\alpha} dx} = \dfrac{K(\alpha-1)}{(\alpha-2)}

if the density decreases in\alpha\cdotsIt is correct when\alphais the tail power of the cdf.)power estimate

Follow

Get every new post delivered to your Inbox.

Join 981 other followers