## how individualistic should statistics be?

Posted in Books, pictures, Statistics with tags , , , , , , , , , , , on November 5, 2015 by xi'an

Keli Liu and Xiao-Li Meng completed a paper on the very nature of inference, to appear in The Annual Review of Statistics and Its Application. This paper or chapter is addressing a fundamental (and foundational) question on drawing inference based a sample on a new observation. That is, in making prediction. To what extent should the characteristics of the sample used for that prediction resemble those of the future observation? In his 1921 book, A Treatise on Probability, Keynes thought this similarity (or individualisation) should be pushed to its extreme, which led him to somewhat conclude on the impossibility of statistics and never to return to the field again. Certainly missing the incoming possibility of comparing models and selecting variables. And not building so much on the “all models are wrong” tenet. On the contrary, classical statistics use the entire data available and the associated model to run the prediction, including Bayesian statistics, although it is less clear how to distinguish between data and control there. Liu & Meng debate about the possibility of creating controls from the data alone. Or “alone” as the model behind always plays a capital role.

“Bayes and Frequentism are two ends of the same spectrum—a spectrum defined in terms of relevance and robustness. The nominal contrast between them (…) is a red herring.”

The paper makes for an exhilarating if definitely challenging read. With a highly witty writing style. If only because the perspective is unusual, to say the least!, and requires constant mental contortions to frame the assertions into more traditional terms.  For instance, I first thought that Bayesian procedures were in agreement with the ultimate conditioning approach, since it conditions on the observables and nothing else (except for the model!). Upon reflection, I am not so convinced that there is such a difference with the frequentist approach in the (specific) sense that they both take advantage of the entire dataset. Either from the predictive or from the plug-in distribution. It all boils down to how one defines “control”.

“Probability and randomness, so tightly yoked in our minds, are in fact distinct concepts (…) at the end of the day, probability is essentially a tool for bookkeeping, just like the abacus.”

Some sentences from the paper made me think of ABC, even though I am not trying to bring everything back to ABC!, as drawing controls is the nature of the ABC game. ABC draws samples or control from the prior predictive and only keeps those for which the relevant aspects (or the summary statistics) agree with those of the observed data. Which opens similar questions about the validity and precision of the resulting inference, as well as the loss of information due to the projection over the summary statistics. While ABC is not mentioned in the paper, it can be used as a benchmark to walk through it.

“In the words of Jack Kiefer, we need to distinguish those problems with luck data’ from those with unlucky data’.”

I liked very much recalling discussions we had with George Casella and Costas Goutis in Cornell about frequentist conditional inference, with the memory of Jack Kiefer still lingering around. However, I am not so excited about the processing of models here since, from what I understand in the paper (!), the probabilistic model behind the statistical analysis must be used to some extent in producing the control case and thus cannot be truly assessed with a critical eye. For instance, of which use is the mean square error when the model behind is unable to produce the observed data? In particular, the variability of this mean squared error is directly driven by this model. Similarly the notion of ancillaries is completely model-dependent. In the classification diagrams opposing robustness to relevance, all methods included therein are parametric. While non-parametric types of inference could provide a reference or a calibration ruler, at the very least.

Also, by continuously and maybe a wee bit heavily referring to the doctor-and-patient analogy, the paper is somewhat confusing as to which parts are analogy and which parts are methodology and to which type of statistical problem is covered by the discussion (sometimes it feels like all problems and sometimes like medical trials).

“The need to deliver individualized assessments of uncertainty are more pressing than ever.”

A final question leads us to an infinite regress: if the statistician needs to turn to individualized inference, at which level of individuality should the statistician be assessed? And who is going to provide the controls then? In any case, this challenging paper is definitely worth reading by (only mature?) statisticians to ponder about the nature of the game!

## anti-séche

Posted in Kids, pictures, University life with tags , , , , on December 21, 2014 by xi'an

## estimating a constant (not really)

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , on October 12, 2012 by xi'an

Larry Wasserman wrote a blog entry on the normalizing constant paradox, where he repeats that he does not understand my earlier point…Let me try to recap here this point and the various comments I made on StackExchange (while keeping in mind all this is for intellectual fun!)

The entry is somehow paradoxical in that Larry acknowledges (in that post) that the analysis in his book, All of Statistics, is wrong. The fact that “g(x)/c is a valid density only for one value of c” (and hence cannot lead to a notion of likelihood on c) is the very reason why I stated that there can be no statistical inference nor prior distribution about c: a sample from f does not bring statistical information about c and there can be no statistical estimate of c based on this sample. (In case you did not notice, I insist upon statistical!)

To me this problem is completely different from a statistical problem, at least in the modern sense: if I need to approximate the constant c—as I do in fact when computing Bayes factors—, I can produce an arbitrarily long sample from a certain importance distribution and derive a converging (and sometimes unbiased) approximation of c. Once again, this is Monte Carlo integration, a numerical technique based on the Law of Large Numbers and the stabilisation of frequencies. (Call it a frequentist method if you wish. I completely agree that MCMC methods are inherently frequentist in that sense, And see no problem with this because they are not statistical methods. Of course, this may be the core of the disagreement with Larry and others, that they call statistics the Law of Large Numbers, and I do not. This lack of separation between both notions also shows up in a recent general public talk on Poincaré’s mistakes by Cédric Villani! All this may just mean I am irremediably Bayesian, seeing anything motivated by frequencies as non-statistical!) But that process does not mean that c can take a range of values that would index a family of densities compatible with a given sample. In this Monte Carlo integration approach, the distribution of the sample is completely under control (modulo the errors induced by pseudo-random generation). This approach is therefore outside the realm of Bayesian analysis “that puts distributions on fixed but unknown constants”, because those unknown constants parameterise the distribution of an observed sample. Ergo, c is not a parameter of the sample and the sample Larry argues about (“we have data sampled from a distribution”) contains no information whatsoever about c that is not already in the function g. (It is not “data” in this respect, but a stochastic sequence that can be used for approximation purposes.) Which gets me back to my first argument, namely that c is known (and at the same time difficult or impossible to compute)!

Let me also answer here the comments on “why is this any different from estimating the speed of light c?” “why can’t you do this with the 100th digit of π?” on the earlier post or on StackExchange. Estimating the speed of light means for me (who repeatedly flunked Physics exams after leaving high school!) that we have a physical experiment that measures the speed of light (as the original one by Rœmer at the Observatoire de Paris I visited earlier last week) and that the statistical analysis infers about c by using those measurements and the impact of the imprecision of the measuring instruments (as we do when analysing astronomical data). If, now, there exists a physical formula of the kind

$c=\int_\Xi \psi(\xi) \varphi(\xi) \text{d}\xi$

where φ is a probability density, I can imagine stochastic approximations of c based on this formula, but I do not consider it a statistical problem any longer. The case is thus clearer for the 100th digit of π: it is also a fixed number, that I can approximate by a stochastic experiment but on which I cannot attach a statistical tag. (It is 9, by the way.) Throwing darts at random as I did during my Oz tour is not a statistical procedure, but simple Monte Carlo à la Buffon…

Overall, I still do not see this as a paradox for our field (and certainly not as a critique of Bayesian analysis), because there is no reason a statistical technique should be able to address any and every numerical problem. (Once again, Persi Diaconis would almost certainly differ, as he defended a Bayesian perspective on numerical analysis in the early days of MCMC…) There may be a “Bayesian” solution to this particular problem (and that would nice) and there may be none (and that would be OK too!), but I am not even convinced I would call this solution “Bayesian”! (Again, let us remember this is mostly for intellectual fun!)

## estimating a constant

Posted in Books, Statistics with tags , , , , , , , , , on October 3, 2012 by xi'an

Paulo (a.k.a., Zen) posted a comment in StackExchange on Larry Wasserman‘s paradox about Bayesians and likelihoodists (or likelihood-wallahs, to quote Basu!) being unable to solve the problem of estimating the normalising constant c of the sample density, f, known up to a constant

$f(x) = c g(x)$

(Example 11.10, page 188, of All of Statistics)

My own comment is that, with all due respect to Larry!, I do not see much appeal in this example, esp. as a potential criticism of Bayesians and likelihood-wallahs…. The constant c is known, being equal to

$1/\int_\mathcal{X} g(x)\text{d}x$

If c is the only “unknown” in the picture, given a sample x1,…,xn, then there is no statistical issue whatsoever about the “problem” and I do not agree with the postulate that there exist estimators of c. Nor priors on c (other than the Dirac mass on the above value). This is not in the least a statistical problem but rather a numerical issue.That the sample x1,…,xn can be (re)used through a (frequentist) density estimate to provide a numerical approximation of c

$\hat c = \hat f(x_0) \big/ g(x_0)$

is a mere curiosity. Not a criticism of alternative statistical approaches: e.g., I could also use a Bayesian density estimate…

Furthermore, the estimate provided by the sample x1,…,xn is not of particular interest since its precision is imposed by the sample size n (and converging at non-parametric rates, which is not a particularly relevant issue!), while I could use importance sampling (or even numerical integration) if I was truly interested in c. I however find the discussion interesting for many reasons

1. it somehow relates to the infamous harmonic mean estimator issue, often discussed on the’Og!;
2. it brings more light on the paradoxical differences between statistics and Monte Carlo methods, in that statistics is usually constrained by the sample while Monte Carlo methods have more freedom in generating samples (up to some budget limits). It does not make sense to speak of estimators in Monte Carlo methods because there is no parameter in the picture, only “unknown” constants. Both fields rely on samples and probability theory, and share many features, but there is nothing like a “best unbiased estimator” in Monte Carlo integration, see the case of the “optimal importance function” leading to a zero variance;
3. in connection with the previous point, the fascinating Bernoulli factory problem is not a statistical problem because it requires an infinite sequence of Bernoullis to operate;
4. the discussion induced Chris Sims to contribute to StackExchange!