## differences between Bayes factors and normalised maximum likelihood

Posted in Books, Kids, Statistics, University life with tags , , , , on November 19, 2014 by xi'an

A recent arXival by Heck, Wagenmaker and Morey attracted my attention: Three Qualitative Differences Between Bayes Factors and Normalized Maximum Likelihood, as it provides an analysis of the differences between Bayesian analysis and Rissanen’s Optimal Estimation of Parameters that I reviewed a while ago. As detailed in this review, I had difficulties with considering the normalised likelihood

$p(x|\hat\theta_x) \big/ \int_\mathcal{X} p(y|\hat\theta_y)\,\text{d}y$

as the relevant quantity. One reason being that the distribution does not make experimental sense: for instance, how can one simulate from this distribution? [I mean, when considering only the original distribution.] Working with the simple binomial B(n,θ) model, the authors show the quantity corresponding to the posterior probability may be constant for most of the data values, produces a different upper bound and hence a different penalty of model complexity, and may differ in conclusion for some observations. Which means that the apparent proximity to using a Jeffreys prior and Rissanen’s alternative does not go all the way. While it is a short note and only focussed on producing an illustration in the Binomial case, I find it interesting that researchers investigate the Bayesian nature (vs. artifice!) of this approach…

Posted in Books, Statistics, University life with tags , , , , , on September 13, 2013 by xi'an

“In the asymptotic limit, the Bayesian cannot justify the strictly positive probability of H0 as an approximation to testing the hypothesis that the parameter value is close to θ0 — which is the hypothesis of real scientific interest”

While revising my Jeffreys-Lindley’s paradox paper for Philosophy of Science, it was suggested (to me) that I read the incoming paper by Jan Sprenger on this paradox. The paper is entitled Testing a Precise Null Hypothesis: The Case of Lindley’s Paradox and it defends the thesis that the regular Bayesian approach (hence the Bayes factor used in the Jeffreys-Lindley’s paradox) is forced to put a prior on the (point) null hypothesis when all that really matters is the vicinity of the null. (I think Andrew would agree there as he positively hates point null hypotheses. See also Rissanen’s perspective about maximal precision allowed by a give sample.) Sprenger then advocates the use of the log score for comparing the full model with the point-null sub-model, i.e. the posterior expectation of the Kullback-Leibler distance between both models:

$\mathbb{E}^\pi\left[\mathbb{E}_\theta\{\log f(X|\theta)/ f(X|\theta_0)\}|x\right],$

rejoining  José Bernardo and Phil Dawid on this ground.

While I agree about the notion that it is impossible to distinguish a small enough departure from the null from the null (no typo!), and I also support the argument that “all models are wrong”, hence point null should eventually—meaning with enough data—rejected, I find the Bayesian solution through the Bayes factor rather appealing because it uses the prior distribution to weight the alternative values of θ in order to oppose their averaged likelihood to the likelihood in θ0. (Note I did not mention Occam!) Further, while the notion of opposing a point null to the rest of the Universe may sound silly, what truly matters is the decisional setting, namely that we want to select a simpler model and use it for later purposes. It is therefore this issue that should be tested, rather than whether or not θ is exactly equal to θ0. I incidentally find it amusing that Sprenger picks the ESP experiment as his illustration in that this is a (the?) clearcut case where the point null: “there is no such thing as ESP” makes sense. Now, it can be argued that what the statistical experiment is assessing is the ESP experiment, for which many objective causes (beyond ESP!) may induce a departure from the null (and from the binomial model). But then this prevents any rational analysis of the test (as is indeed the case!).

The paper thus objects to the use of Bayes factors (and of p-values) to instead  propose to compare scores in the Bernardo-Dawid spirit. As discussed earlier, it has several appealing features, from recovering the Kullback-Leibler divergence between models as a measure of fit  to allowing for the incorporation of improper priors (a point Andrew may disagree with), to avoiding the double use of the data. It is however incomplete in that it creates a discrepancy or a disbalance between both models, thus making the comparison of more than two models difficult to fathom, and it does not readily incorporate the notion of nuisance parameters in the embedded model, seemingly forcing the inclusion of pseudo-priors as in the Bayesian analysis of Aitkin’s integrated likelihood.

## optimal estimation of parameters (book review)

Posted in Books, Statistics with tags , , , , , , , on September 12, 2013 by xi'an

As I had read some of Jorma Rissanen’s papers in the early 1990’s when writing The Bayesian Choice, I was quite excited to learn that Rissanen had written a book on the optimal estimation of parameters, where he presents and develops his own approach to statistical inference (estimation and testing). As explained in the Preface this was induced by having to deliver the 2009 Shannon Lecture at the Information Theory Society conference.

Very few statisticians have been studying information theory, the result of which, I think, is the disarray of the present discipline of statistics.” J. Rissanen (p.2)

Now that I have read the book (between Venezia in the peaceful and shaded Fundamenta Sacca San Girolamo and Hong Kong, so maybe in too a leisurely and off-handed manner), I am not so excited… It is not that the theory presented in optimal estimation of parameters is incomplete or ill-presented: the book is very well-written and well-designed, if in a highly personal (and borderline lone ranger) style. But the approach Rissanen advocates, namely maximum capacity as a generalisation of maximum likelihood, does not seem to relate to my statistical perspective and practice. Even though he takes great care to distance himself from Bayesian theory by repeating that the prior distribution is not necessary for his theory of optimal estimation (“priors are not needed in the general MDL principle”, p.4). my major source of incomprehension lies with the choice of incorporating the estimator within the data density to produce a new density, as in

$\hat{f}(x) = f(x|\hat{\theta}(x)) / \int f(x|\hat{\theta}(x))\,\text{d}x\,.$

Indeed, this leads to (a) replace a statistical model with a structure that mixes the model and the estimation procedure and (b) peak the new distribution by always choosing the most appropriate (local) value of the parameter. For a normal sample with unknown mean θ, this produces for instance to a joint normal distribution that is degenerate since

$\hat{f}(x)\propto f(x|\bar{x}).$

(For a single observation it is not even defined.) In a similar spirit, Rissanen defines this estimated model for dynamic data in a sequential manner, which means in the end that x1 is used n times, x2 n-1 times, and so on.., This asymmetry does not sound logical, especially when considering sufficiency.

…the misunderstanding that the more parameters there are in the model the better it is because it is closer to the truth’ and the truth’ obviously is not simple” J. Rissanen (p.38)

Another point of contention with the approach advocated in optimal estimation of parameters is the inherent discretisation of the parameter space, which seems to exclude large dimensional spaces and complex models. I somehow subscribe to the idea that a given sample (hence a given sample size) induces a maximum precision in the estimation that can be translated into using a finite number of parameter values, but the implementation suggested in the book is essentially unidimensional. I also find the notion of optimality inherent to the statistical part of optimal estimation of parameters quite tautological as it ends up being a target that leads to the maximum likelihood estimator (or its pseudo-Bayesian counterpart).

The BIC criterion has neither information nor a probability theoretic interpretation, and it does not matter which measure for consistency is selected.” J. Rissanen (p.64)

The first part of the book is about coding and information theory; it amounts in my understanding to a justification of the Kullback-Leibler divergence, with an early occurrence (p.27) of the above estimation distribution. (The channel capacity is the normalising constant of this weird density.)

“…in hypothesis testing [where] the assumptions that the hypotheses are  `true’ has misguided the entire field by generating problems which do not exist and distorting rational solutions to problems that do exist.” J. Rissanen (p.41)

I have issues with the definition of confidence intervals as they rely on an implicit choice of a measure and have a constant coverage that decreases with the parameter dimension. This notion also seem to clash with the subsequent discretisation of the parameter space. Hypothesis testing à la Rissanen reduces to an assessment of a goodness of fit, again with fixed coverage properties. Interestingly, the acceptance and rejection regions are based on two quantities, the likelihood ratio and the KL distance (p. 96), which leads to a delayed decision if they do not agree wrt fixed bounds.

“A drawback of the prediction formulas is that they require the knowledge of the ARMA parameters.” J. Rissanen (p.141)

A final chapter on sequential (or dynamic) models reminded me that Rissanen was at the core of inventing variable order Markov chains. The remainder of this chapter provides some properties of the sequential normalised maximum likelihood estimator advocated by the author in the same spirit as the earlier versions.  The whole chapter feels (to me) somewhat disconnected from

In conclusion, Rissanen’s book is a definitely  interesting  entry on a perplexing vision of statistics. While I do not think it will radically alter our understanding and practice of statistics, it is worth perusing, if only to appreciate there are still people (far?) out there attempting to bring a new vision of the field.