## Archive for the Statistics Category

## hue & cry [book review]

Posted in Statistics with tags book review, Church of Scotland, Edinburgh, hue & cry, Scotland, St. Andrew, witchery on December 8, 2018 by xi'an**W**hile visiting the Blackwell’s bookstore by the University of Edinburgh last June, I spotted this historical whodunit in the local interest section. Hue & Cry by Shirley McKay. It stayed on a to-read pile by my bed until a few weeks ago when I started reading it and got more and more engrossed in the story. While the style is not always at its best and the crime aspects are somewhat thin, I find the description of the Scottish society of the time (1570’s) fascinating (and hopefully accurate), especially the absolute dominion of the local Church (Kirk) on every aspect of life and the helplessness of women always under the threat of witchcraft accusations. Which could end up with the death penalty, as in thousands of cases. The book reminds me to some extent of the early Susanna Gregory’s books in that it also involves scholars, teaching well-off students with limited intellectual abilities, while bright but poorer students have to work for the college to make up for their lack of funds. As indicated above, the criminal part is less interesting as the main investigator unfolds the complicated plot without much of a hint. And convinces the juries rather too easily in my opinion. An overall fine novel, nonetheless!

## selected parameters from observations

Posted in Books, Statistics with tags censored data, FDR, joint dis, Journal of the Royal Statistical Society, random effects, ranking and selection, Stephen Senn, truncated normal on December 7, 2018 by xi'an**I** recently read a fairly interesting paper by Daniel Yekutieli on a Bayesian perspective for parameters selected after viewing the data, published in Series B in 2012. (Disclaimer: I was not involved in processing this paper!)

The first example is to differentiate the Normal-Normal mean posterior when θ is N(0,1) and x is N(θ,1) from the restricted posterior when θ is N(0,1) and x is N(θ,1) truncated to (0,∞). By restating the later as the repeated generation from the joint until x>0. This does not sound particularly controversial, except for the notion of *selecting the parameter after viewing the data*. That the posterior support may depend on the data is not that surprising..!

“The observation that selection affects Bayesian inference carries the important implicationthat in Bayesian analysis of large data sets, for each potential parameter,it is necessary to explicitly specify a selection rule that determines when inferenceis provided for the parameter and provide inference that is based on theselection-adjusted posterior distribution of the parameter.” (p.31)

The more interesting distinction is between “fixed” and “random” parameters (Section 2.1), which separate cases where the data is from a truncated distribution (given the parameter) and cases where the joint distribution is truncated but misses the normalising constant (function of θ) for the truncated sampling distribution. The “mixed” case introduces an hyperparameter λ and the normalising constant integrates out θ and depends on λ. Which amounts to switching to another (marginal) prior on θ. This is quite interesting even though one can debate of the very notions of “random” and “mixed” “parameters”, which are those where the posterior most often changes, as true parameters. Take for instance Stephen Senn’s example (p.6) of the mean associated with the largest observation in a Normal mean sample, with distinct means. When accounting for the distribution of the largest variate, this random variable is no longer a Normal variate with a single unknown mean but it instead depends on all the means of the sample. Speaking of the largest observation mean is therefore misleading in that it is neither the mean of the largest observation, nor a parameter *per se* since the index [of the largest observation] is a random variable induced by the observed sample.

In conclusion, a very original article, if difficult to assess as it can be argued that selection models other than the “random” case result from an intentional modelling choice of the joint distribution.

## noninformative Bayesian prior with a finite support

Posted in Statistics, University life with tags Bayesian nonparametrics, data dependent priors, minimal description length principle, minimaxity, noninformative priors, objective Bayes, PNAS on December 4, 2018 by xi'an**A** few days ago, Pierre Jacob pointed me to a PNAS paper published earlier this year on a form of noninformative Bayesian analysis by Henri Mattingly and coauthors. They consider a prior that “maximizes the mutual information between parameters and predictions”, which sounds very much like José Bernardo’s notion of reference priors. With the rather strange twist of having the prior depending on the data size m even they work under an iid assumption. Here information is defined as the difference between the entropy of the prior and the conditional entropy which is not precisely defined in the paper but looks like the expected [in the data x] Kullback-Leibler divergence between prior and posterior. (I have general issues with the paper in that I often find it hard to read for a lack of precision and of definition of the main notions.)

One highly specific (and puzzling to me) feature of the proposed priors is that they are supported by a finite number of atoms, which reminds me very much of the (minimax) least favourable priors over compact parameter spaces, as for instance in the iconic paper by Casella and Strawderman (1984). For the same mathematical reason that non-constant analytic functions must have separated maxima. This is conducted under the assumption and restriction of a compact parameter space, which must be chosen in most cases. somewhat arbitrarily and not without consequences. I can somehow relate to the notion that a finite support prior translates the limited precision in the estimation brought by a finite sample. In other words, given a sample size of m, there is a maximal precision one can hope for, producing further decimals being silly. Still, the fact that the support of the prior is fixed *a priori*, completely independently of the data, is both unavoidable (for the prior to be *prior*!) and very dependent on the choice of the compact set. I would certainly prefer to see a maximal degree of precision expressed *a posteriori*, meaning that the support would then depend on the data. And handling finite support posteriors is rather awkward in that many notions like confidence intervals do not make much sense in that setup. (Similarly, one could argue that Bayesian non-parametric procedures lead to estimates with a finite number of support points but these are determined based on the data, not *a priori*.)

Interestingly, the derivation of the “optimal” prior is operated by iterations where the next prior is the renormalised version of the current prior times the exponentiated Kullback-Leibler divergence, which is “guaranteed to converge to the global maximum” for a discretised parameter space. The authors acknowledge that the resolution is poorly suited to multidimensional settings and hence to complex models, and indeed the paper only covers a few toy examples of moderate and even humble dimensions.

Another difficulty with the paper is the absence of temporal consistency: since the prior depends on the sample size, the posterior for n i.i.d. observations is no longer the prior for the (n+1)th observation.

“Because it weights the irrelevant parameter volume, the Jeffreys prior has strong dependence on microscopic effects invisible to experiment”

I simply do not understand the above sentence that apparently counts as a criticism of Jeffreys (1939). And would appreciate anyone enlightening me! The paper goes into comparing priors through Bayes factors, which ignores the main difficulty of an automated solution such as Jeffreys priors in its inability to handle infinite parameter spaces by being almost invariably improper.