Archive for prediction

IMS workshop [day 2]

Posted in pictures, Statistics, Travel with tags , , , , , , , , , , , , on August 29, 2018 by xi'an

Here are the slides of my talk today on using Wasserstein distances as an intrinsic distance measure in ABC, as developed in our papers with Espen Bernton, Pierre Jacob, and Mathieu Gerber:

This morning, Gael Martin discussed the surprising aspects of ABC prediction, expanding upon her talk at ISBA, with several threads very much worth weaving in the ABC tapestry, one being that summary statistics need be used to increase the efficiency of the prediction, as well as more adapted measures of distance. Her talk also led me ponder about the myriad of possibilities available or not in the most generic of ABC predictions (which is not the framework of Gael’s talk). If we imagine a highly intractable setting, it may be that the marginal generation of a predicted value at time t+1 requires the generation of the entire past from time 1 till time t. Possibly because of a massive dependence on latent variables. And the absence of particle filters. if this makes any sense. Therefore, based on a generated parameter value θ it may be that the entire series needs be simulated to reach the last value in the series. Even when unnecessary this may be an alternative to conditioning upon the actual series. In this later case, comparing both predictions may act as a natural measure of distance since one prediction is a function or statistic of the actual data while the other is a function of the simulated data. Another direction I mused about is the use of (handy) auxiliary models, each producing a prediction as a new statistic, which could then be merged and weighted (or even selected) by a random forest procedure. Again, if the auxiliary models are relatively well-behaved, timewise, this would be quite straightforward to implement.

ABC forecasts

Posted in Books, pictures, Statistics with tags , , , , , , , , on January 9, 2018 by xi'an

My friends and co-authors David Frazier, Gael Martin, Brendan McCabe, and Worapree Maneesoonthorn arXived a paper on ABC forecasting at the turn of the year. ABC prediction is a natural extension of ABC inference in that, provided the full conditional of a future observation given past data and parameters is available but the posterior is not, ABC simulations of the parameters induce an approximation of the predictive. The paper thus considers the impact of this extension on the precision of the predictions. And argues that it is possible that this approximation is preferable to running MCMC in some settings. A first interesting result is that using ABC and hence conditioning on an insufficient summary statistic has no asymptotic impact on the resulting prediction, provided Bayesian concentration of the corresponding posterior takes place as in our convergence paper under revision.

“…conditioning inference about θ on η(y) rather than y makes no difference to the probabilistic statements made about [future observations]”

The above result holds both in terms of convergence in total variation and for proper scoring rules. Even though there is always a loss in accuracy in using ABC. Now, one may think this is a direct consequence of our (and others) earlier convergence results, but numerical experiments on standard time series show the distinct feature that, while the [MCMC] posterior and ABC posterior distributions on the parameters clearly differ, the predictives are more or less identical! With a potential speed gain in using ABC, although comparing parallel ABC versus non-parallel MCMC is rather delicate. For instance, a preliminary parallel ABC could be run as a burnin’ step for parallel MCMC, since all chains would then be roughly in the stationary regime. Another interesting outcome of these experiments is a case when the summary statistics produces a non-consistent ABC posterior, but still leads to a very similar predictive, as shown on this graph.This unexpected accuracy in prediction may further be exploited in state space models, towards producing particle algorithms that are greatly accelerated. Of course, an easy objection to this acceleration is that the impact of the approximation is unknown and un-assessed. However, such an acceleration leaves room for multiple implementations, possibly with different sets of summaries, to check for consistency over replicates.

machine learning and the future of realism

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , on May 4, 2017 by xi'an

Giles and Cliff Hooker arXived a paper last week with this intriguing title. (Giles Hooker is an associate professor of statistics and biology at Cornell U, with an interesting blog on the notion of models, while Cliff Hooker is a professor of philosophy at Newcastle U, Australia.)

“Our conclusion is that simplicity is too complex”

The debate in this short paper is whether or not machine learning relates to a model. Or is it concerned with sheer (“naked”) prediction? And then does it pertain to science any longer?! While it sounds obvious at first, defining why science is more than prediction of effects given causes is much less obvious, although prediction sounds more pragmatic and engineer-like than scientific. (Furthermore, prediction has a somewhat negative flavour in French, being used as a synonym to divination and opposed to prévision.) In more philosophical terms, prediction offers no ontological feature. As for a machine learning structure like a neural network being scientific or a-scientific, its black box nature makes it much more the later than the former, in that it brings no explanation for the connection between input and output, between regressed and regressors. It further lacks the potential for universality of scientific models. For instance, as mentioned in the paper, Newton’s law of gravitation applies to any pair of weighted bodies, while a neural network built on a series of observations could not be assessed or guaranteed outside the domain where those observations are taken. Plus, would miss the simple square law established by Newton. Most fascinating questions, undoubtedly! Putting the stress on models from a totally different perspective from last week at the RSS.

As for machine learning being a challenge to realism, I am none the wiser after reading the paper. Utilising machine learning tools to produce predictions of causes given effects does not seem to modify the structure of the World and very little our understanding of it, since they do not bring explanation per se. What would lead to anti-realism is the adoption of those tools as substitutes for scientific theories and models.

years (and years) of data science

Posted in Books, Statistics, Travel, University life with tags , , , , , , , , , , , , , , on January 4, 2016 by xi'an

In preparation for the round table at the start of the MCMSkv conference, this afternoon, Anto sent us a paper written by David Donoho for the Tukey Centennial workshop, held in Princeton last September. Entitled 50 years of Data Science. And which attracted a whole round of comments, judging from the Google search results. So much that I decided not to read any of them before parsing through the paper. But almost certainly reproducing here with my two cents some of the previous comments.

“John Tukey’s definition of `Big Data’ was `anything that won’t fit on one device’.”

The complaint that data science is essentially statistics that does not dare to spell out statistics as if it were a ten letter word (p.5) is not new, if appropriate. In this paper, David Donoho evacuates the memes that supposedly separate data science from statistics, like “big data” (although I doubt non-statisticians would accept the quick rejection that easily, wondering at the ability of statisticians to develop big models), skills like parallel programming (which ineluctably leads to more rudimentary algorithms and inferential techniques), jobs requiring such a vast array of skills and experience that no graduate student sounds properly trained for it…

“A call to action, from a statistician who fells `the train is leaving the station’.” (p.12)

One point of the paper is to see 1962 John Tukey’s “The Future of Data Analysis” as prophetical of the “Big Data” and “Data Science” crises. Which makes a lot of sense when considering the four driving forces advanced by Tukey (p.11):

  1. formal statistics
  2. advanced computing and graphical devices
  3. the ability to face ever-growing data flows
  4. its adoption by an ever-wider range of fields

“Science about data science will grow dramatically in significance.”

David Donoho then moves on to incorporate   Leo Breiman’s 2001 Two Cultures paper. Which separates machine learning and prediction from statistics and inference, leading to the “big chasm”! And he sees the combination of prediction with “common task framework” as the “secret sauce” of machine learning, because of the possibility of objective comparison of methods on a testing dataset. Which does not seem to me as the explanation for the current (real or perceived) disaffection for statistics and correlated attraction for more computer-related solutions. A code that wins a Kaggle challenge clearly has some efficient characteristics, but this tells me nothing of the abilities of the methodology behind that code. If any. Self-learning how to play chess within 72 hours is great, but is the principle behind able to handle go at the same level?  Plus, I remain worried about the (screaming) absence of model (or models) in predictive approaches. Or at least skeptical. For the same reason it does not help in producing a generic approach to problems. Nor an approximation to the underlying mechanism. I thus see nothing but a black box in many “predictive models”, which tells me nothing about the uncertainty, imprecision or reproducibility of such tools. “Tool evaluation” cannot be reduced to a final score on a testing benchmark. The paper concludes with the prediction that the validation of scientific methodology will solely be empirical (p.37). This leaves little ground if any for probability and uncertainty quantification, as reflected their absence in the paper.

terrible graph of the day

Posted in Books, Kids, R, Statistics with tags , , , , , , on May 12, 2015 by xi'an

A truly terrible graph in Le Monde about overweight and obesity in the EU countries (and Switzerland). The circle presentation makes no logical sense. Countries are ordered by 2030 overweight percentages, which implies the order differs for men and women. (With a neat sexist differentiation between male and female figures.)  The allocation of the (2010) grey bar to its country is unclear (left or right?). And there is no uncertain associated with the 2030 predictions. There is no message coming out of the graph, like the massive explosion in the obesity and overweight percentages in EU countries. Now, given that the data is available for women and men, ‘Og’s readers should feel free to send me alternative representations!

Estimating the number of species

Posted in Statistics with tags , , , , , on November 20, 2009 by xi'an

Bayesian Analysis just published on-line a paper by Hongmei Zhang and Hal Stern on a (new) Bayesian analysis of the problem of estimating the number of unseen species within a population. This problem has always fascinated me, as it seems at first sight to be an impossible problem, how can you estimate the number of species you do not know?! The approach relates to capture-recapture models, with an extra hierarchical layer for the species. The Bayesian analysis of the model obviously makes a lot of sense, with the prior modelling being quite influential. Zhang and Stern use a hierarchical Dirichlet prior on the capture probabilities, \theta_i, when the captures follow a multinomial model

y|\theta,S \sim \mathcal{M}(N, \theta_1,\ldots,\theta_S)

where N=\sum_i y_i the total number of observed individuals,

\mathbf{\theta}|S \sim \mathcal{D}(\alpha,\ldots,\alpha)


\pi(\alpha,S) = f(1-f)^{S-S_\text{min}} \alpha^{-3/2}

forcing the coefficients of the Dirichlet prior towards zero. The paper also covers predictive design, analysing the capture effort corresponding to a given recovery rate of species. The overall approach is not immensely innovative in its methodology, the MCMC part being rather straightforward, but the predictive abilities of the model are nonetheless interesting.

The previously accepted paper in Bayesian Analysis is a note by Ron Christensen about an inconsistent Bayes estimator that you may want to use in an advanced Bayesian class. For all practical purposes, it should not overly worry you, since the example involves a sampling distribution that is normal when its parameter is irrational and is Cauchy otherwise. (The prior is assumed to be absolutely continuous wrt the Lebesgue measure and it thus gives mass zero to the set of rational numbers \mathbb{Q}. The fact that \mathbb{Q} is dense in \mathbb{R} is irrelevant from a measure-theoretic viewpoint.)

Predictive Bayes factors?!

Posted in Statistics with tags , , , , , on September 11, 2009 by xi'an

page53, bloc5page54, bloc5

We (as in we, the Cosmology/Statistics ANR 2005-2009 Ecosstat grant team) are currently working on a Bayesian testing paper with applications to cosmology and my colleagues showed me a paper by Roberto Trotta that I found most intriguing i its introduction of a predictive Bayes factor. A Bayes factor being a function of an observed x or future x^\prime dataset can indeed be predicted (for the latter) in a Bayesian fashion but I find difficult to make sense of the corresponding distribution from an inferential perspective. Here are a few points in the paper to which I object:

  • The Bayes factor associated with x^\prime should be based on x as well if it is to work as a genuine Bayes factor. Otherwise, the information contained in x is ignored;
  • While a Bayes factor eliminates the influence of the prior probabilities of the null and of the alternative hypotheses, the predictive distribution of x^\prime does not:

x^\prime | x \sim p(H_0) m_0(x,x^\prime) + p(H_a) m_a(x,x^\prime)

  • The most natural use of the predictive distribution of B(x,x^\prime) would be to look at the mass above or below 1, thus to produce a sort of Bayesian predictive p-value, falling back into old tracks.
  • If the current observation x is not integrated in the future Bayes factor B(x^\prime), it should be incorporated in the prior, the current posterior being then the future prior. In this case, the quantity of interest is not the predictive of B(x^\prime) but of

B(x,x^\prime) / B(x).

It may be that the disappearance of x from the Bayes factor stems from a fear of “using the data twice“, which is a recurring argument in the criticisms of predictive Bayes inference. I have difficulties with the concept in general and, in the present case, there is no difficulty with using \pi(x^\prime| x) to predict the distribution of B(x,x^\prime).

I also am puzzled by the MCMC strategy suggested in the paper in the case of embedded hypotheses. Trotta argues in §3.1 that it is sufficient to sample from the full model and to derive the Bayes factor by the Savage-Dickey representation, but this does not really agree with the approach of Chen, Shao and Ibrahim, while I think the identity (14) is missing an extra term, namely


which has the surprising feature of depending upon the value of the prior density at a specific value \omega_\star… (Details are in the reproduced pages of my notebook, above, that can be enlarged by clicking on “View Image” and then moving “w=188&h=694&h=261″ to “w=1188&h=694&h=1261” in the page title.) Overall, I find most puzzling that simulating from a distribution over a set \Theta provides information about a distribution that is concentrated over a subset \Theta_0 and that has measure zero against the initial measure. (I am actually suspicious of the Savage-Dickey representation itself, because it also uses the value of the prior and posterior densities at a given value \omega_\star, even though it has a very nice Gibbs interpretation/implementation…)