## machine learning [book review, part 2]

**T**he chapter (Chap. 3) on Bayesian updating or learning (a most appropriate term) for discrete data is well-done in *Machine Learning, a probabilistic perspective* if a bit stretched (which is easy with 1000 pages left!). I like the remark (Section 3.5.3) about the log-sum-exp trick. While lengthy, the chapter (Chap. 4) on Gaussian models has the appeal of introducing LDA. The true chapter about Bayesian statistics (Chap. 5) only comes later, which seems a wee bit late to me, but it mentions the paper by Druilhet and Marin (2007) about the dependence of the MAP estimator on the dominating measure. The Bayesian chapter covers the Bayesian interpretation of false discovery rates, And decision-theory (shared with the following chapter on frequentist statistics). This later chapter also covers the pathologies of p-values. The chapter on regression has a paragraph on the g-prior and its extensions (p.238). There are chapters on DAGs, mixture models, EM (which mentions the early MCEM of Celeux and Diebolt!), factor and principal component analyses, Gaussian processes, CART models, HMMs and state-space models, MFRs, variational Bayes, belief and expectation propagations, and more… Most of the methods are implemented within a MATLAB package called PMTK (probabilistic modelling toolkit) that I did not check (because it is MATLAB!).

**T**here are two (late!) chapters dedicated to simulation methods, Monte Carlo Inference (Chap. 23) and MCMC Inference (Chap.24). (I am somehow unhappy with the label *Inference* in those titles as those are simulation methods.) They cover the basics and more, including particle filters to some extent (but missing some of the most recent stuff, like Del Moral, Doucet & Jasra, 2006, or Andrieu, Doucet & Hollenstein, 2010). (When introducing the Metropolis-Hastings algorithm, the author states the condition that the union of the supports of the proposal should include the support of the target but this is a rather formal condition as the Markov chain may still fail to be irreducible in that case.) My overall feeling is that too much is introduced in too little space, potentially confusing the student. See, e.g., the half-page Section 24.3.7 (p.855) on reversible jump MCMC. Or the other half-page on Hamiltonian MCMC (p.868). An interesting entry is the study of the performances of the original Gibbs sampler of Geman & Geman (1984), which started the field (to some extent). It states that, unless the hyperparameters are extremely well-calibrated, the Gibbs sampler suggested therein fails to produce a useful segmentation algorithm! The section on convergence diagnoses is rather limited and referring to rather oldish methods, rather than suggesting a multiple-chain empirical exploratory approach. Similarly, there is only one page (p.872) of introduction to marginal likelihood approximation techniques, half of which is wasted on the harmonic mean “worst Monte Carlo method ever”. And the other half is spent on criticising Besag‘s candidate method exploited by Chib (1995).

**N**ow, a wee bit more into detailed nitpicking (if only to feed the ‘Og!): first, the mathematical rigour is not always “au rendez-vous” and the handling of Dirac masses and conditionals and big-Oh (Exercise 3.20)( is too hand-waving for my taste (see p.39 for an example). I also dislike the notion of the multinoulli distribution (p.35), first because it is a poor pun on Bernoulli‘s name, second because sufficiency makes this distribution somewhat irrelevant when compared with the multinomial distribution. Although the book rather fairly covers the dangers and shortcomings of MAP estimators in Section 5.2.1.3 (p.150), this remains the default solution. Monte Carlo is not “a city in Europe known for its plush gambling casinos” but the *district* of Monaco where *the* casino stands. And it writes Monte-Carlo in the original. The approximation of π by Monte Carlo is the one I used in my Aussie public lecture, but it would have been nice to know the number of iterations (p.54). The book unnecessarily and most vaguely refers to Taleb about the black swan paradox (p.77). The first introduction of Bayesian tests is to use the HPD interval and check whether the null value is inside, with a prosecutor’s fallacy in conclusion (p.137). BIC then AIC are introduced (p.162) and the reader remains uncertain about which one to use. If any. Not! The fact that the MLE and the posterior mean differ (p.165) is not a sign of informativeness in the prior. The processing of the label switching problem for mixtures (p.841) is confusing in that the inference problem (invariance by permutation that prohibits using posterior means) is compounded by the simulation problem (failing to observe this behaviour in simulations). The Rao-Blackwellisation Theorem (p.841) does not apply to other cases than two-stage Gibbs sampling, but this is not clear from the text. The adaptive MCMC amcmc package of Jeff Rosenthal is not mentioned (because it is in R?). The proof of detailed balance (p.854-855) should take a line. Having so many references (35 pages) is both a bonus and a nuisance in a textbook, where students dislike the repeated occurrence of *“see (so-&-so….”*. I also dislike references being given with a parenthesis at all time, as in *“See (Doucet et al. 2001) for details”*. And, definitely the least important remark!, the quotes at the beginning are not particularly novel or relevant: the book could do without them. (Same thing for the “no free lunch theorem” which is not particularly helpful as presented…)

**I**n conclusion, *Machine Learning, a probabilistic perspective* offers a fairly wide, unifying, and comprehensive perspective on the field of statistics, aka machine learning, that can certainly be used as *the* textbook in a Master program where this is the only course of statistics, aka machine learning. (Having not read other machine learning books thoroughly, I cannot judge how innovative it is. The beginning is trying to build the intuition of what the book is about before introducing the models. Just not my way of proceeding but mostly a matter of taste and maybe of audience…) The computational aspects are not treated in enough depth for my taste and my courses, but there are excellent books on those aspects. The Bayesian thread sometimes run a wee bit thin, but remains a thread nonetheless throughout the book. Thus, a nice textbook for the appropriate course and a reference for many.

October 22, 2013 at 5:33 pm

Dan:

This is like saying business is business accounting because accounting has to be done.

I did have a group of graduate Epidemiology students once ask me to explain MCMC inference (also known as Bayesian inference in less technical literature).

Today we have to use MCMC for most applications, someday maybe just MC and maybe someday systematic approximation (super smart numerical integration) or at least a mix.

October 22, 2013 at 10:14 pm

The point that I was (clumsily) trying to make is that the inference that we make fundamentally depends on the approximation scheme we use for inference. (And you may still disagree with me on this and you may be right – I’ve been wrong before!)

It is typical to treat MCMC as a “ground truth” for Bayesian stats (especially when comparing with other methods) and the reality is that we are computing things and making statistical decisions based on samples from some other distribution.

In practice, the chains we use cannot even target the correct distribution if we run them to infinity. Think of the rejection step. The day 1 lesson of scientific computing is that that type of decision cannot be implemented exactly in floating point.

So basically, all Bayesian computing is Approximate and it’s probably instructive to treat it that way. Or at least find situations where it doesn’t matter.

For example, making incorrect decisions in the Hastings step doesn’t matter too much *as long as you don’t run the chain for too long*! (In the sense that in stationarity you are coupled to an ergodic Markov chain that targets the correct distribution for, on average O(epsilon^{-1}) steps where epsilon is the error)

So you will make the correct decisions, but it’s a challenge to call it Bayesian in any pure sense because you’re doing it all based on possibly non-convergent approximations.

October 23, 2013 at 11:27 am

I (conditionally) agree with you that if we can improve our MCMC without limit then there is no issue here. I just don’t think that we can.

There are fundamental problems with “attainable accuracy” in numerical algorithms that, for hard problems, may induce a significant “false accept/reject rate” and therefore lead to an “uncertainty principle” type thing.

This is particularly true for spatial stats (my great and enduring love), where the condition number of the covariance matrix grows polynomially (sometimes exponentially) with the number of locations. This means that it’s not uncommon to get condition number >> 10^15 which is a problem as, pretty much, every digit in the condition number corresponds to one fewer digit of attainable accuracy.

The question is “does this ever dominate the Cramer-Rao-type bounds?” And I’m not sure what the answer is (in non-pathological/limiting cases)

October 23, 2013 at 7:50 pm

Dan: You might be interested in Larry’s post – http://normaldeviate.wordpress.com/2013/08/19/statistics-and-dr-strangelove/

Where I commented –

Though a more direct connection would be Peirce’s theory of realism where as you can’t ever get at what’s “real” or know it you can instead define what an ongoing community of (infinite) investigators would eventually settle on as being what to take as the truth.

But I am not convinced approximation has such a important limitation.

October 23, 2013 at 10:16 pm

This sounds much more like some kind of idealism or empiricism, than as a special branch of realism… Not that I claim any expertise in philosophy!

October 24, 2013 at 2:17 pm

Peirce also defined “ideal-realism” as “a metaphysical doctrine which combines the principles of idealism and realism.” from http://suo.ieee.org/email/msg13155.html

Not that I have more than just dabbled in this to get some helpful ideas for applying statistics (hopefully) more thoughtfully…

October 22, 2013 at 10:35 pm

I consider there is a distinction to be made between approximation and inference. While the methods we study are producing samples from approximately the posterior distribution rather than the exact posterior itself, we do not draw inference about the unavailable posterior the same way we try to infer about the distribution of the data or about some small number of parameters of interest. The distinction is that, while we can theoretically improve our MC or MCMC approximation with no limit, hence reduce the MC(MC) variability almost arbitrarily, the error resulting from the data cannot get under the Cramer-Rao bound or its equivalent. Monte Carlo methods add to the noise and error, but I find it hard to call them inference…

October 22, 2013 at 1:13 am

I am (given my apparent contrarian nature) actually a fan of the “Monte Carlo inference” tag. Given that (MC)MC forms the basis of pretty much all Bayesian inference and given that it is certainly an approximation, I think the term is very very very appropriate. (Certainly more appropriate than “exact”, “exact-approximate” or even “correct” MCMC!)