## maximum likelihood: an introduction

Posted in Books, Statistics with tags , , , , on December 20, 2014 by xi'an

“Basic Principle 0. Do not trust any principle.” L. Le Cam (1990)

Here is the abstract of a International Statistical Rewiew 1990 paper by Lucien Le Cam on maximum likelihood. ISR keeping a tradition of including an abstract in French for every paper, Le Cam (most presumably) wrote his own translation [or maybe wrote the French version first], which sounds much funnier to me and so I cannot resist posting both, pardon my/his French! [I just find “Ce fait” rather unusual, as I would have rather written “Ceci fait”…]:

Maximum likelihood estimates are reported to be best under all circumstances. Yet there are numerous simple examples where they plainly misbehave. One gives some examples for problems that had not been invented for the purpose of annoying maximum likelihood fans. Another example, imitated from Bahadur, has been specially created with just such a purpose in mind. Next, we present a list of principles leading to the construction of good estimates. The main principle says that one should not believe in principles but study each problem for its own sake.

L’auteur a ouï dire que la méthode du maximum de vraisemblance est la meilleure méthode d’estimation. C’est bien vrai, et pourtant la méthode se casse le nez sur des exemples bien simples qui n’avaient pas été inventés pour le plaisir de montrer que la méthode peut être très désagréable. On en donne quelques-uns, plus un autre, imité de Bahadur et fabriqué exprès pour ennuyer les admirateurs du maximum de vraisemblance. Ce fait, on donne une savante liste de principes de construction de bons estimateurs, le principe principal étant qu’il ne faut pas croire aux principes.

The entire paper is just as witty, as in describing the mixture model as “contaminated and not fit to drink”! Or in “Everybody knows that taking logarithms is unfair”. Or, again, in “biostatisticians, being complicated people, prefer to work out not with the dose y but with its logarithm”… And a last line: “One possibility is that there are too many horse hairs in e”.

## a neat (theoretical) Monte Carlo result

Posted in Books, Statistics, University life with tags , , , , on December 19, 2014 by xi'an

Mark Huber just arXived a short paper where he develops a Monte Carlo approach that bounds the probability of large errors

$\mathbb{P}(|\hat\mu_t-\mu|>\epsilon\mu) < 1/\delta$

by computing a lower bound on the sample size r and I wondered at the presence of μ in the bound as it indicates the approach is not translation invariant. One reason is that the standard deviation of the simulated random variables is bounded by cμ. Another reason is that Mark uses as its estimator the median

$\text{med}(S_1R_1,\ldots,S_tR_t)$

where the S’s are partial averages of sufficient length and the R’s are independent uniforms over (1-ε,1+ε): using those uniforms may improve the coverage of given intervals but it also means that the absolute scale of the error is multiplied by the scale of S, namely μ. I first thought that some a posteriori recentering could improve the bound but since this does not impact the variance of the simulated random variables, I doubt it is possible.

## full Bayesian significance test

Posted in Books, Statistics with tags , , , , , , , , , , on December 18, 2014 by xi'an

Among the many comments (thanks!) I received when posting our Testing via mixture estimation paper came the suggestion to relate this approach to the notion of full Bayesian significance test (FBST) developed by (Julio, not Hal) Stern and Pereira, from São Paulo, Brazil. I thus had a look at this alternative and read the Bayesian Analysis paper they published in 2008, as well as a paper recently published in Logic Journal of IGPL. (I could not find what the IGPL stands for.) The central notion in these papers is the e-value, which provides the posterior probability that the posterior density is larger than the largest posterior density over the null set. This definition bothers me, first because the null set has a measure equal to zero under an absolutely continuous prior (BA, p.82). Hence the posterior density is defined in an arbitrary manner over the null set and the maximum is itself arbitrary. (An issue that invalidates my 1993 version of the Lindley-Jeffreys paradox!) And second because it considers the posterior probability of an event that does not exist a priori, being conditional on the data. This sounds in fact quite similar to Statistical Inference, Murray Aitkin’s (2009) book using a posterior distribution of the likelihood function. With the same drawback of using the data twice. And the other issues discussed in our commentary of the book. (As a side-much-on-the-side remark, the authors incidentally  forgot me when citing our 1992 Annals of Statistics paper about decision theory on accuracy estimators..!)

## Topological sensitivity analysis for systems biology

Posted in Books, Statistics, Travel, University life with tags , , , , , , on December 17, 2014 by xi'an

Michael Stumpf sent me Topological sensitivity analysis for systems biology, written by Ann Babtie and Paul Kirk,  en avant-première before it came out in PNAS and I read it during the trip to NIPS in Montréal. (The paper is published in open access, so everyone can read it now!) The topic is quite central to a lot of debates about climate change, economics, ecology, finance, &tc., namely to assess the impact of using the wrong model to draw conclusions and make decisions about a real phenomenon. (Which reminded me of the distinction between mechanical and phenomenological models stressed by Michael Blum in his NIPS talk.) And it is of much interest from a Bayesian point of view since assessing the worth of a model requires modelling the “outside” of a model, using for instance Gaussian processes as in the talk Tony O’Hagan gave in Warwick earlier this term. I would even go as far as saying that the issue of assessing [and compensating for] how wrong a model is, given available data, may be the (single) most under-assessed issue in statistics. We (statisticians) have yet to reach our Boxian era.

In Babtie et al., the space or universe of models is represented by network topologies, each defining the set of “parents” in a semi-Markov representation of the (dynamic) model. At which stage Gaussian processes are also called for help. Alternative models are ranked in terms of fit according to a distance between simulated data from the original model (sounds like a form of ABC?!). Obviously, there is a limitation in the number and variety of models considered this way, I mean there are still assumptions made on the possible models, while this number of models is increasing quickly with the number of nodes. As pointed out in the paper (see, e.g., Fig.4), the method has a parametric bootstrap flavour, to some extent.

What is unclear is how one can conduct Bayesian inference with such a collection of models. Unless all models share the same “real” parameters, which sounds unlikely. The paper mentions using uniform prior on all parameters, but this is difficult to advocate in a general setting. Another point concerns the quantification of how much one can trust a given model, since it does not seem models are penalised by a prior probability. Hence they all are treated identically. This is a limitation of the approach (or an indication that it is only a preliminary step in the evaluation of models) in that some models within a large enough collection will eventually provide an estimate that differs from those produced by the other models. So the assessment may become altogether highly pessimistic for this very reason.

“If our parameters have a real, biophysical interpretation, we therefore need to be very careful not to assert that we know the true values of these quantities in the underlying system, just because–for a given model–we can pin them down with relative certainty.”

In addition to its relevance for moving towards approximate models and approximate inference, and in continuation of yesterday’s theme, the paper calls for nested sampling to generate samples from the posterior(s) and to compute the evidence associated with each model. (I realised I had missed this earlier paper by Michael and co-authors on nested sampling for system biology.) There is no discussion in the paper on why nested sampling was selected, compared with, say, a random walk Metropolis-Hastings algorithm. Unless it is used in a fully automated way,  but the paper is rather terse on that issue… And running either approach on 10⁷ models in comparison sounds like an awful lot of work!!! Using importance [sampling] nested sampling as we proposed with Nicolas Chopin could be a way to speed up this exploration if all parameters are identical between all or most models.

## an extension of nested sampling

Posted in Books, Statistics, University life with tags , , , , , , , on December 16, 2014 by xi'an

I was reading [in the Paris métro] Hastings-Metropolis algorithm on Markov chains for small-probability estimation, arXived a few weeks ago by François Bachoc, Lionel Lenôtre, and Achref Bachouch, when I came upon their first algorithm that reminded me much of nested sampling: the following was proposed by Guyader et al. in 2011,

To approximate a tail probability P(H(X)>h),

• start from an iid sample of size N from the reference distribution;
• at each iteration m, select the point x with the smallest H(x)=ξ and replace it with a new point y simulated under the constraint H(y)≥ξ;
• stop when all points in the sample are such that H(X)>h;
• take

$\left(1-\dfrac{1}{N}\right)^{m-1}$

as the unbiased estimator of P(H(X)>h).

Hence, except for the stopping rule, this is the same implementation as nested sampling. Furthermore, Guyader et al. (2011) also take advantage of the bested sampling fact that, if direct simulation under the constraint H(y)≥ξ is infeasible, simulating via one single step of a Metropolis-Hastings algorithm is as valid as direct simulation. (I could not access the paper, but the reference list of Guyader et al. (2011) includes both original papers by John Skilling, so the connection must be made in the paper.) What I find most interesting in this algorithm is that it even achieves unbiasedness (even in the MCMC case!).

## NIPS 2014

Posted in Kids, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , on December 15, 2014 by xi'an

Second and last day of the NIPS workshops! The collection of topics was quite broad and would have made my choosing an ordeal, except that I was invited to give a talk at the probabilistic programming workshop, solving my dilemma… The first talk by Kathleen Fisher was quite enjoyable in that it gave a conceptual discussion of the motivations for probabilistic languages, drawing an analogy with the early days of computer programming that saw a separation between higher level computer languages and machine programming, with a compiler interface. And calling for a similar separation between the models faced by statistical inference and machine-learning and the corresponding code, if I understood her correctly. This was connected with Frank Wood’s talk of the previous day where he illustrated the concept through a generation of computer codes to approximately generate from standard distributions like Normal or Poisson. Approximately as in ABC, which is why the organisers invited me to talk in this session. However, I was a wee bit lost in the following talks and presumably lost part of my audience during my talk, as I realised later to my dismay when someone told me he had not perceived the distinction between the trees in the random forest procedure and the phylogenetic trees in the population genetic application. Still, while it had for me a sort of Twilight Zone feeling of having stepped in another dimension, attending this workshop was an worthwhile experiment as an eye-opener into a highly different albeit connected field, where code and simulator may take the place of a likelihood function… To the point of defining Hamiltonian Monte Carlo directly on the former, as Vikash Mansinghka showed me at the break.

I completed the day with the final talks in the variational inference workshop, if only to get back on firmer ground! Apart from attending my third talk by Vikash in the conference (but on a completely different topic on variational approximations for discrete particle-ar distributions), a talk by Tim Salimans linked MCMC and variational approximations, using MCMC and HMC to derive variational bounds. (He did not expand on the opposite use of variational approximations to build better proposals.) Overall, I found these two days and my first NIPS conference quite exciting, if somewhat overpowering, with a different atmosphere and a different pace compared with (small or large) statistical meetings. (And a staggering gender imbalance!)

## ABC à Montréal

Posted in Kids, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , on December 13, 2014 by xi'an

So today was the NIPS 2014 workshop, “ABC in Montréal“, which started with a fantastic talk by Juliane Liepe on some exciting applications of ABC to the migration of immune cells, with the analysis of movies involving those cells acting to heal a damaged fly wing and a cut fish tail. Quite amazing videos, really. (With the great entry line of ‘We have all cut  a finger at some point in our lives’!) The statistical model behind those movies was a random walk on a grid, with different drift and bias features that served as model characteristics. Frank Wood managed to deliver his talk despite a severe case of food poisoning, with a great illustration of probabilistic programming that made me understand (at last!) the very idea of probabilistic programming. And  Vikash Mansinghka presented some applications in image analysis. Those two talks led me to realise why probabilistic programming was so close to ABC, with a programming touch! Hence why I was invited to talk today! Then Dennis Prangle exposed his latest version of lazy ABC, that I have already commented on the ‘Og, somewhat connected with our delayed acceptance algorithm, to the point that maybe something common can stem out of the two notions. Michael Blum ended the day with provocative answers to the provocative question of Ted Meeds as to whether or not machine learning needed ABC (Ans. No!) and whether or not machine learning could help ABC (Ans. ???). With an happily mix-up between mechanistic and phenomenological models that helped generating discussion from the floor.

The posters were also of much interest, with calibration as a distance measure by Michael Guttman, in continuation of the poster he gave at MCMski, Aaron Smith presenting his work with Luke Bornn, Natesh Pillai and Dawn Woodard, on why a single pseudo-sample is enough for ABC efficiency. This gave me the opportunity to discuss with him the apparent contradiction with the result of Kryz Łatunsziński and Anthony Lee about the geometric convergence of ABC-MCMC only attained with a random number of pseudo-samples… And to wonder if there is a geometric versus binomial dilemma in this setting, Namely, whether or not simulating pseudo-samples until one is accepted would be more efficient than just running one and discarding it in case it is too far. So, although the audience was not that large (when compared with the other “ABC in…” and when considering the 2500+ attendees at NIPS over the week!), it was a great day where I learned a lot, did not have a doze during talks (!), [and even had an epiphany of sorts at the treadmill when I realised I just had to take longer steps to reach 16km/h without hyperventilating!] So thanks to my fellow organisers, Neil D Lawrence, Ted Meeds, Max Welling, and Richard Wilkinson for setting the program of that day! And, by the way, where’s the next “ABC in…”?! (Finland, maybe?)