**T**his may sound like an absurd question [and in some sense it is!], but this came out of a recent mathematical riddle on The Riddler, asking for the largest number one could write with ten symbols. The difficulty with this riddle is the definition of a symbol, as the collection of available symbols is a very relative concept. For instance, if one takes the symbols available on a basic pocket calculator, besides the 10 digits and the decimal point, there should be the four basic operations plus square root and square,which means that presumably 999999999² is the largest one can on a cell phone, there are already many more operations, for instance my phone includes the factorial operator and hence 9!!!!!!!!! is a good guess. While moving to a computer the problem becomes somewhat meaningless, both because there are very few software that handle infinite precision computing and hence very large numbers are not achievable without additional coding, and because it very much depends on the environment and on which numbers and symbols are already coded in the local language. As illustrated by this X validated answer, this link from The Riddler, and the xkcd entry below. (The solution provided by The Riddler itself is not particularly relevant as it relies on a particular collection of symbols, which mean Rado’s number BB(9999!) is also a solution within the right referential.)

## Archive for coding

## how large is 9!!!!!!!!!?

Posted in Statistics with tags big numbers, cellphone, coding, cross validated, factorial, mathematical puzzle, pocket calculator, Rado's numbers, The Riddler, Turing machine on March 17, 2017 by xi'an## optimal estimation of parameters (book review)

Posted in Books, Statistics with tags Bayesian theory, coding, inference, maximum capacity estimator, maximum likelihood estimation, Rissanen, Shannonś information, statistical theory on September 12, 2013 by xi'an**As** I had read some of Jorma Rissanen’s papers in the early 1990’s when writing The Bayesian Choice, I was quite excited to learn that Rissanen had written a book on the *optimal estimation of parameters*, where he presents and develops his own approach to statistical inference (estimation *and* testing). As explained in the Preface this was induced by having to deliver the 2009 Shannon Lecture at the Information Theory Society conference.

“

Very few statisticians have been studying information theory, the result of which, I think, is the disarray of the present discipline of statistics.” J. Rissanen (p.2)

**N**ow that I have read the book (between Venezia in the peaceful and shaded Fundamenta Sacca San Girolamo and Hong Kong, so maybe in too a leisurely and off-handed manner), I am not so excited… It is not that the theory presented in *optimal estimation of parameters* is incomplete or ill-presented: the book is very well-written and well-designed, if in a highly personal (and borderline lone ranger) style. But the approach Rissanen advocates, namely maximum capacity as a generalisation of maximum likelihood, does not seem to relate to my statistical perspective and practice. Even though he takes great care to distance himself from Bayesian theory by repeating that the prior distribution is not necessary for his theory of *optimal estimation* (“priors are not needed in the general MDL principle”, p.4). my major source of incomprehension lies with the choice of incorporating the estimator within the data density to produce a new density, as in

Indeed, this leads to (a) replace a statistical model with a structure that mixes the model and the estimation procedure and (b) peak the new distribution by always choosing the most appropriate (local) value of the parameter. For a normal sample with unknown mean *θ*, this produces for instance to a joint normal distribution that is degenerate since

*(For a single observation it is not even defined.)* In a similar spirit, Rissanen defines this estimated model for dynamic data in a sequential manner, which means in the end that x_{1} is used n times, x_{2} n-1 times, and so on.., This asymmetry does not sound logical, especially when considering sufficiency.

“

…the misunderstanding that the more parameters there are in the model the better it is because it is closer to the `truth’ and the `truth’ obviously is not simple” J. Rissanen (p.38)

**A**nother point of contention with the approach advocated in *optimal estimation of parameters* is the inherent discretisation of the parameter space, which seems to exclude large dimensional spaces and complex models. I somehow subscribe to the idea that a given sample (hence a given sample size) induces a maximum precision in the estimation that can be translated into using a finite number of parameter values, but the implementation suggested in the book is essentially unidimensional. I also find the notion of optimality inherent to the statistical part of *optimal estimation of parameters *quite tautological as it ends up being a target that leads to the maximum likelihood estimator (or its pseudo-Bayesian counterpart).

“

The BIC criterion has neither information nor a probability theoretic interpretation, and it does not matter which measure for consistency is selected.” J. Rissanen (p.64)

**T**he first part of the book is about coding and information theory; it amounts in my understanding to a justification of the Kullback-Leibler divergence, with an early occurrence (p.27) of the above estimation distribution. (The *channel capacity* is the normalising constant of this weird density.)

“…

in hypothesis testing [where] the assumptions that the hypotheses are `true’ has misguided the entire field by generating problems which do not exist and distorting rational solutions to problems that do exist.” J. Rissanen (p.41)

I have issues with the definition of confidence intervals as they rely on an implicit choice of a measure and have a constant coverage that decreases with the parameter dimension. This notion also seem to clash with the subsequent discretisation of the parameter space. Hypothesis testing *à la* Rissanen reduces to an assessment of a goodness of fit, again with fixed coverage properties. Interestingly, the acceptance and rejection regions are based on two quantities, the likelihood ratio and the KL distance (p. 96), which leads to a delayed decision if they do not agree wrt fixed bounds.

“A drawback of the prediction formulas is that they require the knowledge of the ARMA parameters.”J. Rissanen (p.141)

**A** final chapter on sequential (or dynamic) models reminded me that Rissanen was at the core of inventing variable order Markov chains. The remainder of this chapter provides some properties of the sequential normalised maximum likelihood estimator advocated by the author in the same spirit as the earlier versions. The whole chapter feels (to me) somewhat disconnected from

**I**n conclusion, Rissanen’s book is a definitely interesting entry on a perplexing vision of statistics. While I do not think it will radically alter our understanding and practice of statistics, it is worth perusing, if only to appreciate there are still people (far?) out there attempting to bring a new vision of the field.