## best unbiased estimators

Posted in Books, Kids, pictures, Statistics, University life with tags , , , , , , , , , , , , on January 18, 2018 by xi'an

A question that came out on X validated today kept me busy for most of the day! It relates to an earlier question on the best unbiased nature of a maximum likelihood estimator, to which I pointed out the simple case of the Normal variance when the estimate is not unbiased (but improves the mean square error). Here, the question is whether or not the maximum likelihood estimator of a location parameter, when corrected from its bias, is the best unbiased estimator (in the sense of the minimal variance). The question is quite interesting in that it links to the mathematical statistics of the 1950’s, of Charles Stein, Erich Lehmann, Henry Scheffé, and Debabrata Basu. For instance, if there exists a complete sufficient statistic for the problem, then there exists a best unbiased estimator of the location parameter, by virtue of the Lehmann-Scheffé theorem (it is also a consequence of Basu’s theorem). And the existence is pretty limited in that outside the two exponential families with location parameter, there is no other distribution meeting this condition, I believe. However, even if there is no complete sufficient statistic, there may still exist best unbiased estimators, as shown by Bondesson. But Lehmann and Scheffé in their magisterial 1950 Sankhya paper exhibit a counter-example, namely the U(θ-1,θ-1) distribution:

since no non-constant function of θ allows for a best unbiased estimator.

Looking in particular at the location parameter of a Cauchy distribution, I realised that the Pitman best equivariant estimator is unbiased as well [for all location problems] and hence dominates the (equivariant) maximum likelihood estimator which is unbiased in this symmetric case. However, as detailed in a nice paper of Gabriela Freue on this problem, I further discovered that there is no uniformly minimal variance estimator and no uniformly minimal variance unbiased estimator! (And that the Pitman estimator enjoys a closed form expression, as opposed to the maximum likelihood estimator.) This sounds a bit paradoxical but simply means that there exists different unbiased estimators which variance functions are not ordered and hence not comparable. Between them and with the variance of the Pitman estimator.

## 10 great ideas about chance [book preview]

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , on November 13, 2017 by xi'an

[As I happened to be a reviewer of this book by Persi Diaconis and Brian Skyrms, I had the opportunity (and privilege!) to go through its earlier version. Here are the [edited] comments I sent back to PUP and the authors about this earlier version. All in  all, a terrific book!!!]

The historical introduction (“measurement”) of this book is most interesting, especially its analogy of chance with length. I would have appreciated a connection earlier than Cardano, like some of the Greek philosophers even though I gladly discovered there that Cardano was not only responsible for the closed form solutions to the third degree equation. I would also have liked to see more comments on the vexing issue of equiprobability: we all spend (if not waste) hours in the classroom explaining to (or arguing with) students why their solution is not correct. And they sometimes never get it! [And we sometimes get it wrong as well..!] Why is such a simple concept so hard to explicit? In short, but this is nothing but a personal choice, I would have made the chapter more conceptual and less chronologically historical.

“Coherence is again a question of consistent evaluations of a betting arrangement that can be implemented in alternative ways.” (p.46)

The second chapter, about Frank Ramsey, is interesting, if only because it puts this “man of genius” back under the spotlight when he has all but been forgotten. (At least in my circles.) And for joining probability and utility together. And for postulating that probability can be derived from expectations rather than the opposite. Even though betting or gambling has a (negative) stigma in many cultures. At least gambling for money, since most of our actions involve some degree of betting. But not in a rational or reasoned manner. (Of course, this is not a mathematical but rather a psychological objection.) Further, the justification through betting is somewhat tautological in that it assumes probabilities are true probabilities from the start. For instance, the Dutch book example on p.39 produces a gain of .2 only if the probabilities are correct.

> gain=rep(0,1e4)
> for (t in 1:1e4){
+ p=rexp(3);p=p/sum(p)
+ gain[t]=(p[1]*(1-.6)+p[2]*(1-.2)+p[3]*(.9-1))/sum(p)}
> hist(gain)

As I made it clear at the BFF4 conference last Spring, I now realise I have never really adhered to the Dutch book argument. This may be why I find the chapter somewhat unbalanced with not enough written on utilities and too much on Dutch books.

“The force of accumulating evidence made it less and less plausible to hold that subjective probability is, in general, approximate psychology.” (p.55)

A chapter on “psychology” may come as a surprise, but I feel a posteriori that it is appropriate. Most of it is about the Allais paradox. Plus entries on Ellesberg’s distinction between risk and uncertainty, with only the former being quantifiable by “objective” probabilities. And on Tversky’s and Kahneman’s distinction between heuristics, and the framing effect, i.e., how the way propositions are expressed impacts the choice of decision makers. However, it is leaving me unclear about the conclusion that the fact that people behave irrationally should not prevent a reliance on utility theory. Unclear because when taking actions involving other actors their potentially irrational choices should also be taken into account. (This is mostly nitpicking.)

“This is Bernoulli’s swindle. Try to make it precise and it falls apart. The conditional probabilities go in different directions, the desired intervals are of different quantities, and the desired probabilities are different probabilities.” (p.66)

The next chapter (“frequency”) is about Bernoulli’s Law of Large numbers and the stabilisation of frequencies, with von Mises making it the basis of his approach to probability. And Birkhoff’s extension which is capital for the development of stochastic processes. And later for MCMC. I like the notions of “disreputable twin” (p.63) and “Bernoulli’s swindle” about the idea that “chance is frequency”. The authors call the identification of probabilities as limits of frequencies Bernoulli‘s swindle, because it cannot handle zero probability events. With a nice link with the testing fallacy of equating rejection of the null with acceptance of the alternative. And an interesting description as to how Venn perceived the fallacy but could not overcome it: “If Venn’s theory appears to be full of holes, it is to his credit that he saw them himself.” The description of von Mises’ Kollectiven [and the welcome intervention of Abraham Wald] clarifies my previous and partial understanding of the notion, although I am unsure it is that clear for all potential readers. I also appreciate the connection with the very notion of randomness which has not yet found I fear a satisfactory definition. This chapter asks more (interesting) questions than it brings answers (to those or others). But enough, this is a brilliant chapter!

“…a random variable, the notion that Kac found mysterious in early expositions of probability theory.” (p.87)

Chapter 5 (“mathematics”) is very important [from my perspective] in that it justifies the necessity to associate measure theory with probability if one wishes to evolve further than urns and dices. To entitle Kolmogorov to posit his axioms of probability. And to define properly conditional probabilities as random variables (as my third students fail to realise). I enjoyed very much reading this chapter, but it may prove difficult to read for readers with no or little background in measure (although some advanced mathematical details have vanished from the published version). Still, this chapter constitutes a strong argument for preserving measure theory courses in graduate programs. As an aside, I find it amazing that mathematicians (even Kac!) had not at first realised the connection between measure theory and probability (p.84), but maybe not so amazing given the difficulty many still have with the notion of conditional probability. (Now, I would have liked to see some description of Borel’s paradox when it is mentioned (p.89).

“Nothing hangs on a flat prior (…) Nothing hangs on a unique quantification of ignorance.” (p.115)

The following chapter (“inverse inference”) is about Thomas Bayes and his posthumous theorem, with an introduction setting the theorem at the centre of the Hume-Price-Bayes triangle. (It is nice that the authors include a picture of the original version of the essay, as the initial title is much more explicit than the published version!) A short coverage, in tune with the fact that Bayes only contributed a twenty-plus paper to the field. And to be logically followed by a second part [formerly another chapter] on Pierre-Simon Laplace, both parts focussing on the selection of prior distributions on the probability of a Binomial (coin tossing) distribution. Emerging into a discussion of the position of statistics within or even outside mathematics. (And the assertion that Fisher was the Einstein of Statistics on p.120 may be disputed by many readers!)

“So it is perfectly legitimate to use Bayes’ mathematics even if we believe that chance does not exist.” (p.124)

The seventh chapter is about Bruno de Finetti with his astounding representation of exchangeable sequences as being mixtures of iid sequences. Defining an implicit prior on the side. While the description sticks to binary events, it gets quickly more advanced with the notion of partial and Markov exchangeability. With the most interesting connection between those exchangeabilities and sufficiency. (I would however disagree with the statement that “Bayes was the father of parametric Bayesian analysis” [p.133] as this is extrapolating too much from the Essay.) My next remark may be non-sensical, but I would have welcomed an entry at the end of the chapter on cases where the exchangeability representation fails, for instance those cases when there is no sufficiency structure to exploit in the model. A bonus to the chapter is a description of Birkhoff’s ergodic theorem “as a generalisation of de Finetti” (p..134-136), plus half a dozen pages of appendices on more technical aspects of de Finetti’s theorem.

“We want random sequences to pass all tests of randomness, with tests being computationally implemented”. (p.151)

The eighth chapter (“algorithmic randomness”) comes (again!) as a surprise as it centres on the character of Per Martin-Löf who is little known in statistics circles. (The chapter starts with a picture of him with the iconic Oberwolfach sculpture in the background.) Martin-Löf’s work concentrates on the notion of randomness, in a mathematical rather than probabilistic sense, and on the algorithmic consequences. I like very much the section on random generators. Including a mention of our old friend RANDU, the 16 planes random generator! This chapter connects with Chapter 4 since von Mises also attempted to define a random sequence. To the point it feels slightly repetitive (for instance Jean Ville is mentioned in rather similar terms in both chapters). Martin-Löf’s central notion is computability, which forces us to visit Turing’s machine. And its role in the undecidability of some logical statements. And Church’s recursive functions. (With a link not exploited here to the notion of probabilistic programming, where one language is actually named Church, after Alonzo Church.) Back to Martin-Löf, (I do not see how his test for randomness can be implemented on a real machine as the whole test requires going through the entire sequence: since this notion connects with von Mises’ Kollektivs, I am missing the point!) And then Kolmororov is brought back with his own notion of complexity (which is also Chaitin’s and Solomonov’s). Overall this is a pretty hard chapter both because of the notions it introduces and because I do not feel it is completely conclusive about the notion(s) of randomness. A side remark about casino hustlers and their “exploitation” of weak random generators: I believe Jeff Rosenthal has a similar if maybe simpler story in his book about Canadian lotteries.

“Does quantum mechanics need a different notion of probability? We think not.” (p.180)

The penultimate chapter is about Boltzmann and the notion of “physical chance”. Or statistical physics. A story that involves Zermelo and Poincaré, And Gibbs, Maxwell and the Ehrenfests. The discussion focus on the definition of probability in a thermodynamic setting, opposing time frequencies to space frequencies. Which requires ergodicity and hence Birkhoff [no surprise, this is about ergodicity!] as well as von Neumann. This reaches a point where conjectures in the theory are yet open. What I always (if presumably naïvely) find fascinating in this topic is the fact that ergodicity operates without requiring randomness. Dynamical systems can enjoy ergodic theorem, while being completely deterministic.) This chapter also discusses quantum mechanics, which main tenet requires probability. Which needs to be defined, from a frequency or a subjective perspective. And the Bernoulli shift that brings us back to random generators. The authors briefly mention the Einstein-Podolsky-Rosen paradox, which sounds more metaphysical than mathematical in my opinion, although they get to great details to explain Bell’s conclusion that quantum theory leads to a mathematical impossibility (but they lost me along the way). Except that we “are left with quantum probabilities” (p.183). And the chapter leaves me still uncertain as to why statistical mechanics carries the label statistical. As it does not seem to involve inference at all.

“If you don’t like calling these ignorance priors on the ground that they may be sharply peaked, call them nondogmatic priors or skeptical priors, because these priors are quite in the spirit of ancient skepticism.” (p.199)

And then the last chapter (“induction”) brings us back to Hume and the 18th Century, where somehow “everything” [including statistics] started! Except that Hume’s strong scepticism (or skepticism) makes induction seemingly impossible. (A perspective with which I agree to some extent, if not to Keynes’ extreme version, when considering for instance financial time series as stationary. And a reason why I do not see the criticisms contained in the Black Swan as pertinent because they savage normality while accepting stationarity.) The chapter rediscusses Bayes’ and Laplace’s contributions to inference as well, challenging Hume’s conclusion of the impossibility to finer. Even though the representation of ignorance is not unique (p.199). And the authors call again for de Finetti’s representation theorem as bypassing the issue of whether or not there is such a thing as chance. And escaping inductive scepticism. (The section about Goodman’s grue hypothesis is somewhat distracting, maybe because I have always found it quite artificial and based on a linguistic pun rather than a logical contradiction.) The part about (Richard) Jeffrey is quite new to me but ends up quite abruptly! Similarly about Popper and his exclusion of induction. From this chapter, I appreciated very much the section on skeptical priors and its analysis from a meta-probabilist perspective.

There is no conclusion to the book, but to end up with a chapter on induction seems quite appropriate. (But there is an appendix as a probability tutorial, mentioning Monte Carlo resolutions. Plus notes on all chapters. And a commented bibliography.) Definitely recommended!

[Disclaimer about potential self-plagiarism: this post or an edited version will eventually appear in my Books Review section in CHANCE. As appropriate for a book about Chance!]

## inferential models: reasoning with uncertainty [book review]

Posted in Books, Statistics, University life with tags , , , , , , , , , on October 6, 2016 by xi'an

“the field of statistics (…) is still surprisingly underdeveloped (…) the subject lacks a solid theory for reasoning with uncertainty [and] there has been very little progress on the foundations of statistical inference” (p.xvi)

A book that starts with such massive assertions is certainly hoping to attract some degree of attention from the field and likely to induce strong reactions to this dismissal of the not inconsiderable amount of research dedicated so far to statistical inference and in particular to its foundations. Or even attarcting flak for not accounting (in this introduction) for the past work of major statisticians, like Fisher, Kiefer, Lindley, Cox, Berger, Efron, Fraser and many many others…. Judging from the references and the tone of this 254 pages book, it seems like the two authors, Ryan Martin and Chuanhai Liu, truly aim at single-handedly resetting the foundations of statistics to their own tune, which sounds like a new kind of fiducial inference augmented with calibrated belief functions. Be warned that five chapters of this book are built on as many papers written by the authors in the past three years. Which makes me question, if I may, the relevance of publishing a book on a brand-new approach to statistics without further backup from a wider community.

“…it is possible to calibrate our belief probabilities for a common interpretation by intelligent minds.” (p.14)

Chapter 1 contains a description of the new perspective in Section 1.4.2, which I find useful to detail here. When given an observation x from a Normal N(θ,1) model, the authors rewrite X as θ+Z, with Z~N(0,1), as in fiducial inference, and then want to find a “meaningful prediction of Z independently of X”. This seems difficult to accept given that, once X=x is observed, Z=X-θ⁰, θ⁰ being the true value of θ, which belies the independence assumption. The next step is to replace Z~N(0,1) by a random set S(Z) containing Z and to define a belief function bel() on the parameter space Θ by

bel(A|X) = P(X-S(Z)⊆A)

which induces a pseudo-measure on Θ derived from the distribution of an independent Z, since X is already observed. When Z~N(0,1), this distribution does not depend on θ⁰ the true value of θ… The next step is to choose the belief function towards a proper frequentist coverage, in the approximate sense that the probability that bel(A|X) be more than 1-α is less than α when the [arbitrary] parameter θ is not in A. And conversely. This property (satisfied when bel(A|X) is uniform) is called validity or exact inference by the authors: in my opinion, restricted frequentist calibration would certainly sound more adequate.

“When there is no prior information available, [the philosophical justifications for Bayesian analysis] are less than fully convincing.” (p.30)

“Is it logical that an improper “ignorance” prior turns into a proper “non-ignorance” prior when combined with some incomplete information on the whereabouts of θ?” (p.44)

## Approximate Bayesian computation via sufficient dimension reduction

Posted in Statistics, University life with tags , , , , , on August 26, 2016 by xi'an

“One of our contribution comes from the mathematical analysis of the consequence of conditioning the parameters of interest on consistent statistics and intrinsically inconsistent statistics”

Xiaolong Zhong and Malay Ghosh have just arXived an ABC paper focussing on the convergence of the method. And on the use of sufficient dimension reduction techniques for the construction of summary statistics. I had not heard of this approach before so read the paper with interest. I however regret that the paper does not link with the recent consistency results of Liu and Fearnhead and of Daniel Frazier, Gael Martin, Judith Rousseau and myself. When conditioning upon the MLE [or the posterior mean] as the summary statistic, Theorem 1 states that the Bernstein-von Mises theorem holds, missing a limit in the tolerance ε. And apparently missing conditions on the speed of convergence of this tolerance to zero although the conditioning event involves the true value of the parameter. This makes me wonder at the relevance of the result. The part about partial posteriors and the characterisation of limiting posterior distributions stats with the natural remark that the mean of the summary statistic must identify the whole parameter θ to achieve consistency, a point central to our 2014 JRSS B paper. The authors suggest using a support vector machine to derive the summary statistics, an idea already exploited by Heiko Strathmann et al.. There is no consistency result of relevance for ABC in that second and final part, which ends up rather abruptly. Overall, while the paper contributes to the current reflection on the convergence properties of ABC, the lack of scaling of the tolerance ε calls for further investigations.

## Inference for stochastic simulation models by ABC

Posted in Books, Statistics, University life with tags , , , , , on February 13, 2015 by xi'an

Hartig et al. published a while ago (2011) a paper  in Ecology Letters entitled “Statistical inference for stochastic simulation models – theory and application”, which is mostly about ABC. (Florian Hartig pointed out the paper to me in a recent blog comment. about my discussion of the early parts of Guttman and Corander’s paper.) The paper is largely a tutorial and it reminds the reader about related methods like indirect inference and methods of moments. The authors also insist on presenting ABC as a particular case of likelihood approximation, whether non-parametric or parametric. Making connections with pseudo-likelihood and pseudo-marginal approaches. And including a discussion of the possible misfit of the assumed model, handled by an external error model. And also introducing the notion of informal likelihood (which could have been nicely linked with empirical likelihood). A last class of approximations presented therein is called rejection filters and reminds me very much of Ollie Ratman’s papers.

“Our general aim is to find sufficient statistics that are as close to minimal sufficiency as possible.” (p.819)

As in other ABC papers, and as often reported on this blog, I find the stress on sufficiency a wee bit too heavy as those models calling for approximation almost invariably do not allow for any form of useful sufficiency. Hence the mathematical statistics notion of sufficiency is mostly useless in such settings.

“A basic requirement is that the expectation value of the point-wise approximation of p(Sobs|φ) must be unbiased” (p.823)

As stated above the paper is mostly in tutorial mode, for instance explaining what MCMC and SMC methods are. As illustrated by the above figure. There is however a final and interesting discussion section on the impact of estimating the likelihood function at different values of the parameter. However, the authors seem to focus solely on pseudo-marginal results to validate this approximation, hence on unbiasedness, which does not work for most ABC approaches that I know. And for the approximations listed in the survey. Actually, it would be quite beneficial to devise a cheap tool to assess the bias or extra-variation due to the use of approximative techniques like ABC… A sort of 21st Century bootstrap?!

## that the median cannot be a sufficient statistic

Posted in Kids, Statistics, University life with tags , , , , , on November 14, 2014 by xi'an

When reading an entry on The Chemical Statistician that a sample median could often be a choice for a sufficient statistic, it attracted my attention as I had never thought a median could be sufficient. After thinking a wee bit more about it, and even posting a question on cross validated, but getting no immediate answer, I came to the conclusion that medians (and other quantiles) cannot be sufficient statistics for arbitrary (large enough) sample sizes (a condition that excludes the obvious cases of one & two observations where the sample median equals the sample mean).

In the case when the support of the distribution does not depend on the unknown parameter θ, we can invoke the Darmois-Pitman-Koopman theorem, namely that the density of the observations is necessarily of the exponential family form,

$\exp\{ \theta T(x) - \psi(\theta) \}h(x)$

to conclude that, if the natural sufficient statistic

$S=\sum_{i=1}^n T(x_i)$

is minimal sufficient, then the median is a function of S, which is impossible since modifying an extreme in the n>2 observations modifies S but not the median.

In the other case when the support does depend on the unknown parameter θ, we can consider the case when

$f(x|\theta) = h(x) \mathbb{I}_{A_\theta}(x) \tau(\theta)$

where the set indexed by θ is the support of f. In that case, the factorisation theorem implies that

$\prod_{i=1}^n \mathbb{I}_{A_\theta}(x_i)$

is a 0-1 function of the sample median. Adding a further observation y⁰ which does not modify the median then leads to a contradiction since it may be in or outside the support set.

Incidentally, if an aside, when looking for examples, I played with the distribution

$\dfrac{1}{2}\mathfrak{U}(0,\theta)+\dfrac{1}{2}\mathfrak{U}(\theta,1)$

which has θ as its theoretical median if not mean. In this example, not only the sample median is not sufficient (the only sufficient statistic is the order statistic and rightly so since the support is fixed and the distributions not in an exponential family), but the MLE is also different from the sample median. Here is an example with n=30 observations, the sienna bar being the sample median:

## improved approximate-Bayesian model-choice method for estimating shared evolutionary history [reply from the author]

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , on June 3, 2014 by xi'an

[Here is a very kind and detailed reply from Jamie Oakes to the comments I made on his ABC paper a few days ago:]

First of all, many thanks for your thorough review of my pre-print! It is very helpful and much appreciated. I just wanted to comment on a few things you address in your post.

I am a little confused about how my replacement of continuous uniform probability distributions with gamma distributions for priors on several parameters introduces a potentially crippling number of hyperparameters. Both uniform and gamma distributions have two parameters. So, the new model only has one additional hyperparameter compared to the original msBayes model: the concentration parameter on the Dirichlet process prior on divergence models. Also, the new model offers a uniform prior over divergence models (though I don’t recommend it).

Your comment about there being no new ABC technique is 100% correct. The model is new, the ABC numerical machinery is not. Also, your intuition is correct, I do not use the divergence times to calculate summary statistics. I mention the divergence times in the description of the ABC algorithm with the hope of making it clear that the times are scaled (see Equation (12)) prior to the simulation of the data (from which the summary statistics are calculated). This scaling is simply to go from units proportional to time, to units that are proportional to the expected number of mutations. Clearly, my attempt at clarity only created unnecessary opacity. I’ll have to make some edits.

Regarding the reshuffling of the summary statistics calculated from different alignments of sequences, the statistics are not exchangeable. So, reshuffling them in a manner that is not conistent across all simulations and the observed data is not mathematically valid. Also, if elements are exchangeable, their order will not affect the likelihood (or the posterior, barring sampling error). Thus, if our goal is to approximate the likelihood, I would hope the reshuffling would also have little affect on the approximate posterior (otherwise my approximation is not so good?).

You are correct that my use of “bias” was not well defined in reference to the identity line of my plots of the estimated vs true probability of the one-divergence model. I think we can agree that, ideally (all assumptions are met), the estimated posterior probability of a model should estimate the probability that the model is correct. For large numbers of simulation
replicates, the proportion of the replicates for which the one-divergence model is true will approximate the probability that the one-divergence model is correct. Thus, if the method has the desirable (albeit “frequentist”) behavior such that the estimated posterior probability of the one-divergence model is an unbiased estimate of the probability that the one-divergence model is correct, the points should fall near the identity line. For example, let us say the method estimates a posterior probability of 0.90 for the one-divergence model for 1000 simulated datasets. If the method is accurately estimating the probability that the one-divergence model is the correct model, then the one-divergence model should be the true model for approximately 900 of the 1000 datasets. Any trend away from the identity line indicates the method is biased in the (frequentist) sense that it is not correctly estimating the probability that the one-divergence model is the correct model. I agree this measure of “bias” is frequentist in nature. However, it seems like a worthwhile goal for Bayesian model-choice methods to have good frequentist properties. If a method strongly deviates from the identity line, it is much more difficult to interpret the posterior probabilites that it estimates. Going back to my example of the posterior probability of 0.90 for 1000 replicates, I would be alarmed if the model was true in only 100 of the replicates.

My apologies if my citation of your PNAS paper seemed misleading. The citation was intended to be limited to the context of ABC methods that use summary statistics that are insufficient across the models under comparison (like msBayes and the method I present in the paper). I will definitely expand on this sentence to make this clearer in revisions. Thanks!

Lastly, my concluding remarks in the paper about full-likelihood methods in this domain are not as lofty as you might think. The likelihood function of the msBayes model is tractable, and, in fact, has already been derived and implemented via reversible-jump MCMC (albeit, not readily available yet). Also, there are plenty of examples of rich, Kingman-coalescent models implemented in full-likelihood Bayesian frameworks. Too many to list, but a lot of them are implemented in the BEAST software package. One noteworthy example is the work of Bryant et al. (2012, Molecular Biology and Evolution, 29(8), 1917–32) that analytically integrates over all gene trees for biallelic markers under the coalescent.