“Despite the pursuit of the holy grail of sufficient statistics, most applications will have to settle for the weakest concept of optimal statistics.”Quiz #1:How does Bayes sufficiency [which preserves the posterior density] differ from sufficiency [which preserves the likelihood function]?Quiz #2:

How does Fisher-information sufficiency [which preserves the information matrix] differ from standard sufficiency [which preserves the likelihood function]?

Read a recent arXival by Till Hoffmann and Jukka-Pekka Onnela that I frankly found most puzzling… Maybe due to the Norman train where I was traveling being particularly noisy.

The argument in the paper is to find a summary statistic that minimises the [empirical] expected posterior entropy, which equivalently means minimising the expected Kullback-Leibler distance to the full posterior. And maximizing the mutual information between parameters θ and summaries t(.). And maximizing the expected surprise. Which obviously requires breaking the sample into iid components and hence considering the gain brought by a specific transform of a *single* observation. The paper also contains a long comparison with other criteria for choosing summaries.

“Minimizing the posterior entropy would discard the sufficient statistic t such that the posterior is equal to the prior–we have not learned anything from the data.”

Furthermore, the *expected* aspect of the criterion takes us away from a proper Bayes analysis (and exhibits artifacts as the one above), which somehow makes me question the relevance of comparing entropies under different distributions. It took me a long while to realise that the collection of summaries was set by the user and quite limited. Like a neural network representation of the posterior mean. And the *intractable* posterior is further *approximated* by a closed-form function of the parameter θ and of the summary t(.). Using there a neural density estimator. Or a mixture density network.