## sufficient statistics for machine learning

Posted in Books, Running, Statistics, Travel with tags , , , , , on April 26, 2022 by xi'an

By chance, I came across this ICML¹⁹ paper of Milan Cvitkovic and nther Koliander, Minimal Achievable Sufficient Statistic Learning on a form of sufficiency for machine learning. The paper starts with “our” standard notion of sufficiency albeit in a predictive sense, namely that Z=T(X) is sufficient for predicting Y if the conditional distribution of Y given Z is the same as the conditional distribution of Y given X. It also acknowledges that minimal sufficiency may be out of reach. However, and without pursuing this question into the depths of said paper, I am surprised that any type of sufficiency can be achieved there since the model stands outside exponential families… In accordance with the Darmois-Pitman-Koopman lemma. Obviously, this is not a sufficiency notion in the statistical sense, since there is no likelihood (albeit there are parameters involved in the deep learning network). And Y is a discrete variate, which means that

$\mathbb P(Y=1|x),\ \mathbb P(Y=2|x),\ldots$

is a sufficient “statistic” for a fixed conditional, but I am lost at how the solution proposed in the paper, could be minimal when the dimension and structure of T(x) are chosen from the start. A very different notion, for sure!

## conjugate priors and sufficient statistics

Posted in Statistics with tags , , , , , on March 29, 2021 by xi'an

An X validated question rekindled my interest in the connection between sufficiency and conjugacy, by asking whether or not there was an equivalence between the existence of a (finite dimension) conjugate family of priors and the existence of a fixed (in n, the sample size) dimension sufficient statistic. Outside exponential families, meaning that the support of the sampling distribution need vary with the parameter.

While the existence of a sufficient statistic T of fixed dimension d whatever the (large enough) sample size n seems to clearly imply the existence of a (finite dimension) conjugate family of priors, or rather of a family associated with each possible dominating (prior) measure,

$\mathfrak F=\{ \tilde \pi(\theta)\propto \tilde {f_n}(t_n(x_{1:n})|\theta) \pi_0(\theta)\,;\ n\in \mathbb N, x_{1:n}\in\mathfrak X^n\}$

the reverse statement is a wee bit more delicate to prove, due to the varying supports of the sampling or prior distributions. Unless some conjugate prior in the assumed family has an unrestricted support, the argument seems to limit sufficiency to a particular subset of the parameter set. I think that the result remains correct in general but could not rigorously wrap up the proof

## principles of uncertainty (second edition)

Posted in Books, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , , , on July 21, 2020 by xi'an

A new edition of Principles of Uncertainty is about to appear. I was asked by CRC Press to review the new book and here are some (raw) extracts from my review. (Some comments may not apply to the final and published version, mind.)

In Chapter 6, the proof of the Central Limit Theorem utilises the “smudge” technique, which is to add an independent noise to both the sequence of rvs and its limit. This is most effective and reminds me of quite a similar proof Jacques Neveu used in its probability notes in Polytechnique. Which went under the more formal denomination of convolution, with the same (commendable) purpose of avoiding Fourier transforms. If anything, I would have favoured a slightly more condensed presentation in less than 8 pages. Is Corollary 6.5.8 useful or even correct??? I do not think so because the non-centred average rescaled by √n diverges almost surely. For the same reason, I object to the very first sentence of Section 6.5 (p.246)

In Chapter 7, I found a nice mention of (Hermann) Rubin’s insistence on not separating probability and utility as only the product matters. And another fascinating quote from Keynes, not from his early statistician’s years, but in 1937 as an established economist

“The sense in which I am using the term uncertain is that in which the prospect of a European war is uncertain, or the price of copper and the rate of interest twenty years hence, or the obsolescence of a new invention, or the position of private wealth-owners in the social system in 1970. About these matters there is no scientific basis on which to form any calculable probability whatever. We simply do not know. Nevertheless, the necessity for action and for decision compels us as practical men to do our best to overlook this awkward fact and to behave exactly as we should if we had behind us a good Benthamite calculation of a series of prospective advantages and disadvantages, each multiplied by its appropriate probability, waiting to the summed.”

(is the last sentence correct? I would have expected, pardon my French!, “to be summed”). Further interesting trivia on the criticisms of utility theory, including de Finetti’s role and his own lack of connection with subjective probability principles.

In Chapter 8, a major remark (iii) is found p.293 about the fact that a conjugate family requires a dominating measure (although this is expressed differently since the book shies away from introducing measure theory, ) reminds me of a conversation I had with Jay when I visited Carnegie Mellon in 2013 (?). Which exposes the futility of seeing conjugate priors as default priors. It is somewhat surprising that a notion like admissibility appears as a side quantity when discussing Stein’s paradox in 8.2.1 [and then later in Section 9.1.3] while it seems to me to be central to Bayesian decision theory, much more than the epiphenomenon that Stein’s paradox represents in the big picture. But the book dismisses minimaxity even faster in Section 9.1.4:

As many who suffer from paranoia have discovered, one can always dream-up an even worse possibility to guard against. Thus, the minimax framework is unstable. (p.336)

Interesting introduction of the Wishart distribution to kindly handle random matrices and matrix Jacobians, with the original space being the p(p+1)/2 real space (implicitly endowed with the Lebesgue measure). Rather than a more structured matricial space. A font error makes Corollary 8.7.2 abort abruptly. The space of positive definite matrices is mentioned in Section8.7.5 but still (implicitly) corresponds to the common p(p+1)/2 real Euclidean space. Another typo in Theorem 8.9.2 with a Frenchised version of Dirichlet, Dirichelet. Followed by a Dirchlet at the end of the proof (p.322). Again and again on p.324 and on following pages. I would object to the singular in the title of Section 8.10 as there are exponential families rather than a single one. With no mention made of Pitman-Koopman lemma and its consequences, namely that the existence of conjugacy remains an epiphenomenon. Hence making the amount of pages dedicated to gamma, Dirichlet and Wishart distributions somewhat excessive.

In Chapter 9, I noticed (p.334) a Scheffe that should be Scheffé (and again at least on p.444). (I love it that Jay also uses my favorite admissible (non-)estimator, namely the constant value estimator with value 3.) I wonder at the worth of a ten line section like 9.3, when there are delicate issues in handling meta-analysis, even in a Bayesian mood (or mode). In the model uncertainty section, Jay discuss the (im)pertinence of both selecting one of the models and setting independent priors on their respective parameters, with which I disagree on both levels. Although this is followed by a more reasonable (!) perspective on utility. Nice to see a section on causation, although I would have welcomed an insert on the recent and somewhat outrageous stand of Pearl (and MacKenzie) on statisticians missing the point on causation and counterfactuals by miles. Nonparametric Bayes is a new section, inspired from Ghahramani (2005). But while it mentions Gaussian and Dirichlet [invariably misspelled!] processes, I fear it comes short from enticing the reader to truly grasp the meaning of a prior on functions. Besides mentioning it exists, I am unsure of the utility of this section. This is one of the rare instances where measure theory is discussed, only to state this is beyond the scope of the book (p.349).

## p-values, Bayes factors, and sufficiency

Posted in Books, pictures, Statistics with tags , , , , , , , , , on April 15, 2019 by xi'an

Among the many papers published in this special issue of TAS on statistical significance or lack thereof, there is a paper I had already read before (besides ours!), namely the paper by Jonty Rougier (U of Bristol, hence the picture) on connecting p-values, likelihood ratio, and Bayes factors. Jonty starts from the notion that the p-value is induced by a transform, summary, statistic of the sample, t(x), the larger this t(x), the less likely the null hypothesis, with density f⁰(x), to create an embedding model by exponential tilting, namely the exponential family with dominating measure f⁰, and natural statistic, t(x), and a positive parameter θ. In this embedding model, a Bayes factor can be derived from any prior on θ and the p-value satisfies an interesting double inequality, namely that it is less than the likelihood ratio, itself lower than any (other) Bayes factor. One novel aspect from my perspective is that I had thought up to now that this inequality only holds for one-dimensional problems, but there is no constraint here on the dimension of the data x. A remark I presumably made to Jonty on the first version of the paper is that the p-value itself remains invariant under a bijective increasing transform of the summary t(.). This means that there exists an infinity of such embedding families and that the bound remains true over all such families, although the value of this minimum is beyond my reach (could it be the p-value itself?!). This point is also clear in the justification of the analysis thanks to the Pitman-Koopman lemma. Another remark is that the perspective can be inverted in a more realistic setting when a genuine alternative model M¹ is considered and a genuine likelihood ratio is available. In that case the Bayes factor remains smaller than the likelihood ratio, itself larger than the p-value induced by the likelihood ratio statistic. Or its log. The induced embedded exponential tilting is then a geometric mixture of the null and of the locally optimal member of the alternative. I wonder if there is a parameterisation of this likelihood ratio into a p-value that would turn it into a uniform variate (under the null). Presumably not. While the approach remains firmly entrenched within the realm of p-values and Bayes factors, this exploration of a natural embedding of the original p-value is definitely worth mentioning in a class on the topic! (One typo though, namely that the Bayes factor is mentioned to be lower than one, which is incorrect.)

## Darmois, Koopman, and Pitman

Posted in Books, Statistics with tags , , , , , , , , on November 15, 2017 by xi'an

When [X’ed] seeking a simple proof of the Pitman-Koopman-Darmois lemma [that exponential families are the only types of distributions with constant support allowing for a fixed dimension sufficient statistic], I came across a 1962 Stanford technical report by Don Fraser containing a short proof of the result. Proof that I do not fully understand as it relies on the notion that the likelihood function itself is a minimal sufficient statistic.