Jeffreys priors for mixtures [or not]

Posted in Books, Statistics, University life with tags , , , , , on July 25, 2017 by xi'an

Clara Grazian and I have just arXived [and submitted] a paper on the properties of Jeffreys priors for mixtures of distributions. (An earlier version had not been deemed of sufficient interest by Bayesian Analysis.) In this paper, we consider the formal Jeffreys prior for a mixture of Gaussian distributions and examine whether or not it leads to a proper posterior with a sufficient number of observations.  In general, it does not and hence cannot be used as a reference prior. While this is a negative result (and this is why Bayesian Analysis did not deem it of sufficient importance), I find it definitely relevant because it shows that the default reference prior [in the sense that the Jeffreys prior is the primary choice in nonparametric settings] does not operate in this wide class of distributions. What is surprising is that the use of a Jeffreys-like prior on a global location-scale parameter (as in our 1996 paper with Kerrie Mengersen or our recent work with Kaniav Kamary and Kate Lee) remains legit if proper priors are used on all the other parameters. (This may be yet another illustration of the tequilla-like toxicity of mixtures!)

Francisco Rubio and Mark Steel already exhibited this difficulty of the Jeffreys prior for mixtures of densities with disjoint supports [which reveals the mixture latent variable and hence turns the problem into something different]. Which relates to another point of interest in the paper, derived from a 1988 [Valencià Conference!] paper by José Bernardo and Javier Giròn, where they show the posterior associated with a Jeffreys prior on a mixture is proper when (a) only estimating the weights p and (b) using densities with disjoint supports. José and Javier use in this paper an astounding argument that I had not seen before and which took me a while to ingest and accept. Namely, the Jeffreys prior on a observed model with latent variables is bounded from above by the Jeffreys prior on the corresponding completed model. Hence if the later leads to a proper posterior for the observed data, so does the former. Very smooth, indeed!!!

Actually, we still support the use of the Jeffreys prior but only for the mixture mixtures, because it has the property supported by Judith and Kerrie of a conservative prior about the number of components. Obviously, we cannot advocate its use over all the parameters of the mixture since it then leads to an improper posterior.

what makes variables randoms [book review]

Posted in Books, Mountains, Statistics with tags , , , , , , on July 19, 2017 by xi'an

When the goal of a book is to make measure theoretic probability available to applied researchers for conducting their research, I cannot but applaud! Peter Veazie’s goal of writing “a brief text that provides a basic conceptual introduction to measure theory” (p.4) is hence most commendable. Before reading What makes variables random, I was uncertain how this could be achieved with a limited calculus background, given the difficulties met by our third year maths students. After reading the book, I am even less certain this is feasible!

“…it is the data generating process that makes the variables random and not the data.”

Chapter 2 is about basic notions of set theory. Chapter 3 defines measurable sets and measurable functions and integrals against a given measure μ as

$\sup_\pi \sum_{A\in\pi}\inf_{\omega\in A} f(\omega)\mu(A)$

which I find particularly unnatural compared with the definition through simple functions (esp. because it does not tell how to handle 0x∞). The ensuing discussion shows the limitation of the exercise in that the definition is only explained for finite sets (since the notion of a partition achieving the supremum on page 29 is otherwise meaningless). A generic problem with the book, in that most examples in the probability section relate to discrete settings (see the discussion of the power set p.66). I also did not see a justification as to why measurable functions enjoy well-defined integrals in the above sense. All in all, to see less than ten pages allocated to measure theory per se is rather staggering! For instance,

$\int_A f\text{d}\mu$

does not appear to be defined at all.

“…the mathematical probability theory underlying our analyses is just mathematics…”

Chapter 4 moves to probability measures. It distinguishes between objective (or frequentist) and subjective measures, which is of course open to diverse interpretations. And the definition of a conditional measure is the traditional one, conditional on a set rather than on a σ-algebra. Surprisingly as this is in my opinion one major reason for using measures in probability theory. And avoids unpleasant issues such as Bertrand’s paradox. While random variables are defined in the standard sense of real valued measurable functions, I did not see a definition of a continuous random variables or of the Lebesgue measure. And there are only a few lines (p.48) about the notion of expectation, which is so central to measure-theoretic probability as to provide a way of entry into measure theory! Progressing further, the σ-algebra induced by a random variable is defined as a partition (p.52), a particularly obscure notion for continuous rv’s. When the conditional density of one random variable given the realisation of another is finally introduced (p.63), as an expectation reconciling with the set-wise definition of conditional probabilities, it is in a fairly convoluted way that I fear will scare newcomers out of their wit. Since it relies on a sequence of nested sets with positive measure, implying an underlying topology and the like, which somewhat shows the impossibility of the overall task…

“In the Bayesian analysis, the likelihood provides meaning to the posterior.”

Statistics is hurriedly introduced in a short section at the end of Chapter 4, assuming the notion of likelihood is already known by the readers. But nitpicking (p.65) at the representation of the terms in the log-likelihood as depending on an unspecified parameter value θ [not to be confused with the data-generating value of θ, which does not appear clearly in this section]. Section that manages to include arcane remarks distinguishing maximum likelihood estimation from Bayesian analysis, all this within a page! (Nowhere is the Bayesian perspective clearly defined.)

“We should no more perform an analysis clustered by state than we would cluster by age, income, or other random variable.”

The last part of the book is about probabilistic models, drawing a distinction between data generating process models and data models (p.89), by which the author means the hypothesised probabilistic model versus the empirical or bootstrap distribution. An interesting way to relate to the main thread, except that the convergence of the data distribution to the data generating process model cannot be established at this level. And hence that the very nature of bootstrap may be lost on the reader. A second and final chapter covers some common or vexing problems and the author’s approach to them. Revolving around standard errors, fixed and random effects. The distinction between standard deviation (“a mathematical property of a probability distribution”) and standard error (“representation of variation due to a data generating process”) that is followed for several pages seems to boil down to a possible (and likely) model mis-specification. The chapter also contains an extensive discussion of notations, like indexes (or indicators), which seems a strange focus esp. at this location in the book. Over 15 pages! (Furthermore, I find quite confusing that a set of indices is denoted there by the double barred I, usually employed for the indicator function.)

“…the reader will probably observe the conspicuous absence of a time-honoured topic in calculus courses, the “Riemann integral”… Only the stubborn conservatism of academic tradition could freeze it into a regular part of the curriculum, long after it had outlived its historical importance.” Jean Dieudonné, Foundations of Modern Analysis

In conclusion, I do not see the point of this book, from its insistence on measure theory that never concretises for lack of mathematical material to an absence of convincing examples as to why this is useful for the applied researcher, to the intended audience which is expected to already quite a lot about probability and statistics, to a final meandering around linear models that seems at odds with the remainder of What makes variables random, without providing an answer to this question. Or to the more relevant one of why Lebesgue integration is preferable to Riemann integration. (Not that there does not exist convincing replies to this question!)

Fourth Bayesian, Fiducial, and Frequentist Conference

Posted in Books, pictures, Statistics, Travel, University life, Wines with tags , , , , , , , on March 29, 2017 by xi'an

Next May 1-3, I will attend the 4th Bayesian, Fiducial and Frequentist Conference at Harvard University (hopefully not under snow at that time of year), which is a meeting between philosophers and statisticians about foundational thinking in statistics and inference under uncertainty. This should be fun! (Registration is now open.)

X-Outline of a Theory of Statistical Estimation

Posted in Books, Statistics, University life with tags , , , , , , , , , , on March 23, 2017 by xi'an

While visiting Warwick last week, Jean-Michel Marin pointed out and forwarded me this remarkable paper of Jerzy Neyman, published in 1937, and presented to the Royal Society by Harold Jeffreys.

“Leaving apart on one side the practical difficulty of achieving randomness and the meaning of this word when applied to actual experiments…”

“It may be useful to point out that although we are frequently witnessing controversies in which authors try to defend one or another system of the theory of probability as the only legitimate, I am of the opinion that several such theories may be and actually are legitimate, in spite of their occasionally contradicting one another. Each of these theories is based on some system of postulates, and so long as the postulates forming one particular system do not contradict each other and are sufficient to construct a theory, this is as legitimate as any other. “

This paper is fairly long in part because Neyman starts by setting Kolmogorov’s axioms of probability. This is of historical interest but also needed for Neyman to oppose his notion of probability to Jeffreys’ (which is the same from a formal perspective, I believe!). He actually spends a fair chunk on explaining why constants cannot have anything but trivial probability measures. Getting ready to state that an a priori distribution has no meaning (p.343) and that in the rare cases it does it is mostly unknown. While reading the paper, I thought that the distinction was more in terms of frequentist or conditional properties of the estimators, Neyman’s arguments paving the way to his definition of a confidence interval. Assuming repeatability of the experiment under the same conditions and therefore same parameter value (p.344).

“The advantage of the unbiassed [sic] estimates and the justification of their use lies in the fact that in cases frequently met the probability of their differing very much from the estimated parameters is small.”

“…the maximum likelihood estimates appear to be what could be called the best “almost unbiassed [sic]” estimates.”

It is also quite interesting to read that the principle for insisting on unbiasedness is one of producing small errors, because this is not that often the case, as shown by the complete class theorems of Wald (ten years later). And that maximum likelihood is somewhat relegated to a secondary rank, almost unbiased being understood as consistent. A most amusing part of the paper is when Neyman inverts the credible set into a confidence set, that is, turning what is random in a constant and vice-versa. With a justification that the credible interval has zero or one coverage, while the confidence interval has a long-run validity of returning the correct rate of success. What is equally amusing is that the boundaries of a credible interval turn into functions of the sample, hence could be evaluated on a frequentist basis, as done later by Dennis Lindley and others like Welch and Peers, but that Neyman fails to see this and turn the bounds into hard values. For a given sample.

“This, however, is not always the case, and in general there are two or more systems of confidence intervals possible corresponding to the same confidence coefficient α, such that for certain sample points, E’, the intervals in one system are shorter than those in the other, while for some other sample points, E”, the reverse is true.”

The resulting construction of a confidence interval is then awfully convoluted when compared with the derivation of an HPD region, going through regions of acceptance that are the dual of a confidence interval (in the sampling space), while apparently [from my hasty read] missing a rule to order them. And rejecting the notion of a confidence interval being possibly empty, which, while being of practical interest, clashes with its frequentist backup.

round-table on Bayes[ian[ism]]

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , , on March 7, 2017 by xi'an

In a [sort of] coincidence, shortly after writing my review on Le bayésianisme aujourd’hui, I got invited by the book editor, Isabelle Drouet, to take part in a round-table on Bayesianism in La Sorbonne. Which constituted the first seminar in the monthly series of the séminaire “Probabilités, Décision, Incertitude”. Invitation that I accepted and honoured by taking place in this public debate (if not dispute) on all [or most] things Bayes. Along with Paul Egré (CNRS, Institut Jean Nicod) and Pascal Pernot (CNRS, Laboratoire de chimie physique). And without a neuroscientist, who could not or would not attend.

While nothing earthshaking came out of the seminar, and certainly not from me!, it was interesting to hear of the perspectives of my philosophy+psychology and chemistry colleagues, the former explaining his path from classical to Bayesian testing—while mentioning trying to read the book Statistical rethinking reviewed a few months ago—and the later the difficulty to teach both colleagues and students the need for an assessment of uncertainty in measurements. And alluding to GUM, developed by the Bureau International des Poids et Mesures I visited last year. I tried to present my relativity viewpoints on the [relative] nature of the prior, to avoid the usual morass of debates on the nature and subjectivity of the prior, tried to explain Bayesian posteriors via ABC, mentioned examples from The Theorem that Would not Die, yet untranslated into French, and expressed reserves about the glorious future of Bayesian statistics as we know it. This seminar was fairly enjoyable, with none of the stress induced by the constraints of a radio-show. Just too bad it did not attract a wider audience!

le bayésianisme aujourd’hui [book review]

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , , , , , , on March 4, 2017 by xi'an

It is quite rare to see a book published in French about Bayesian statistics and even rarer to find one that connects philosophy of science, foundations of probability, statistics, and applications in neurosciences and artificial intelligence. Le bayésianisme aujourd’hui (Bayesianism today) was edited by Isabelle Drouet, a Reader in Philosophy at La Sorbonne. And includes a chapter of mine on the basics of Bayesian inference (à la Bayesian Choice), written in French like the rest of the book.

The title of the book is rather surprising (to me) as I had never heard the term Bayesianism mentioned before. As shown by this link, the term apparently exists. (Even though I dislike the sound of it!) The notion is one of a probabilistic structure of knowledge and learning, à la Poincaré. As described in the beginning of the book. But I fear the arguments minimising the subjectivity of the Bayesian approach should not be advanced, following my new stance on the relativity of probabilistic statements, if only because they are defensive and open the path all too easily to counterarguments. Similarly, the argument according to which the “Big Data” era makesp the impact of the prior negligible and paradoxically justifies the use of Bayesian methods is limited to the case of little Big Data, i.e., when the observations are more or less iid with a limited number of parameters. Not when the number of parameters explodes. Another set of arguments that I find both more modern and compelling [for being modern is not necessarily a plus!] is the ease with which the Bayesian framework allows for integrative and cooperative learning. Along with its ultimate modularity, since each component of the learning mechanism can be extracted and replaced with an alternative. Continue reading

another wrong entry

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , on June 27, 2016 by xi'an

Quite a coincidence! I just came across another bug in Lynch’s (2007) book, Introduction to Applied Bayesian Statistics and Estimation for Social Scientists. Already discussed here and on X validated. While working with one participant to the post-ISBA softshop, we were looking for efficient approaches to simulating correlation matrices and came [by Google] across the above R code associated with a 3×3 correlation matrix, which misses the additional constraint that the determinant must be positive. As shown e.g. by the example

> eigen(matrix(c(1,-.8,.7,-.8,1,.6,.7,.6,1),ncol=3))
\$values
[1] 1.8169834 1.5861960 -0.4031794

having all correlations between -1 and 1 is not enough. Just. Not. Enough.