## machine learning [book review, part 2]

Posted in Books, R, Statistics, University life with tags , , , , , , , on October 22, 2013 by xi'an

The chapter (Chap. 3) on Bayesian updating or learning (a most appropriate term) for discrete data is well-done in Machine Learning, a probabilistic perspective if a bit stretched (which is easy with 1000 pages left!). I like the remark (Section 3.5.3) about the log-sum-exp trick. While lengthy, the chapter (Chap. 4) on Gaussian models has the appeal of introducing LDA. The true chapter about Bayesian statistics (Chap. 5) only comes later, which seems a wee bit late to me, but it mentions the paper by Druilhet and Marin (2007) about the dependence of the MAP estimator on the dominating measure. The Bayesian chapter covers the Bayesian interpretation of false discovery rates, And decision-theory (shared with the following chapter on frequentist statistics). This later chapter also covers the pathologies of p-values. The chapter on regression has a paragraph on the g-prior and its extensions (p.238). There are chapters on DAGs, mixture models, EM (which mentions the early MCEM of Celeux and Diebolt!), factor and principal component analyses, Gaussian processes, CART models, HMMs and state-space models, MFRs, variational Bayes, belief and expectation propagations,  and more… Most of the methods are implemented within a MATLAB package called PMTK (probabilistic modelling toolkit) that I did not check (because it is MATLAB!).

There are two (late!) chapters dedicated to simulation methods, Monte Carlo Inference (Chap. 23) and MCMC Inference (Chap.24). (I am somehow unhappy with the label Inference in those titles as those are simulation methods.) They cover the basics and more, including particle filters to some extent (but missing some of the most recent stuff, like Del Moral, Doucet & Jasra, 2006, or Andrieu, Doucet & Hollenstein, 2010). (When introducing the Metropolis-Hastings algorithm, the author states the condition that the union of the supports of the proposal should include the support of the target but this is a rather formal condition as the Markov chain may still fail to be irreducible in that case.) My overall feeling is that too much is introduced in too little space, potentially confusing the student. See, e.g., the half-page Section 24.3.7 (p.855) on reversible jump MCMC. Or the other half-page on Hamiltonian MCMC (p.868). An interesting entry is the study of the performances of the original Gibbs sampler of Geman & Geman (1984), which started the field (to some extent). It states that, unless the hyperparameters are extremely well-calibrated, the Gibbs sampler suggested therein fails to produce a useful segmentation algorithm! The section on convergence diagnoses is rather limited and referring to rather oldish methods, rather than suggesting a multiple-chain empirical exploratory approach. Similarly, there is only one page (p.872) of introduction to marginal likelihood approximation techniques, half of which is wasted on the harmonic mean “worst Monte Carlo method ever”. And the other half is spent on criticising Besag‘s candidate method exploited by Chib (1995).

Now, a wee bit more into detailed nitpicking (if only to feed the ‘Og!): first, the mathematical rigour is not always “au rendez-vous” and the handling of Dirac masses and conditionals and big-Oh (Exercise 3.20)( is too hand-waving for my taste (see p.39 for an example). I also dislike the notion of the multinoulli distribution (p.35), first because it is a poor pun on Bernoulli‘s name, second because sufficiency makes this distribution somewhat irrelevant when compared with the multinomial distribution. Although the book rather fairly covers the dangers and shortcomings of MAP estimators in Section 5.2.1.3 (p.150), this remains the default solution. Monte Carlo is not “a city in Europe known for its plush gambling casinos” but the district of Monaco where the casino stands. And it writes Monte-Carlo in the original. The approximation of π by Monte Carlo is the one I used in my Aussie public lecture, but it would have been nice to know the number of iterations (p.54). The book unnecessarily and most vaguely refers to Taleb about the black swan paradox (p.77). The first introduction of Bayesian tests is to use the HPD  interval and check whether the null value is inside, with a prosecutor’s fallacy in conclusion (p.137). BIC then AIC are introduced (p.162) and the reader remains uncertain about which one to use. If any. Not! The fact that the MLE and the posterior mean differ (p.165) is not a sign of informativeness in the prior. The processing of the label switching problem for mixtures (p.841) is confusing in that the inference problem (invariance by permutation that prohibits using posterior means) is compounded by the simulation problem (failing to observe this behaviour in simulations). The Rao-Blackwellisation Theorem (p.841) does not apply to other cases than two-stage Gibbs sampling, but this is not clear from the text. The adaptive MCMC amcmc package of Jeff Rosenthal is not mentioned (because it is in R?). The proof of detailed balance (p.854-855) should take a line. Having so many references (35 pages) is both a bonus and a nuisance in a textbook, where students dislike the repeated occurrence of “see (so-&-so….”. I also dislike references being given with a parenthesis at all time, as in “See (Doucet et al. 2001) for details”.  And, definitely the least important remark!, the quotes at the beginning are not particularly novel or relevant: the book could do without them. (Same thing for the “no free lunch theorem” which is not particularly helpful as presented…)

In conclusion, Machine Learning, a probabilistic perspective offers a fairly wide, unifying, and comprehensive perspective on the field of statistics, aka machine learning, that can certainly be used as the textbook in a Master program where this is the only course of statistics, aka machine learning. (Having not read other machine learning books thoroughly, I cannot judge how innovative it is. The beginning is trying to build the intuition of what the book is about before introducing the models. Just not my way of proceeding but mostly a matter of taste and maybe of audience…) The computational aspects are not treated in enough depth for my taste and my courses, but there are excellent books on those aspects. The Bayesian thread sometimes run a wee bit thin, but remains a thread nonetheless throughout the book. Thus, a nice textbook for the appropriate course and a reference for many.

## Carnon [and Core, end]

Posted in Books, Kids, pictures, R, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , on June 16, 2012 by xi'an

Yet another full day working on Bayesian Core with Jean-Michel in Carnon… This morning, I ran along the canal for about an hour and at last saw some pink flamingos close enough to take pictures (if only to convince my daughter that there were flamingos in the area!). Then I worked full-time on the spatial statistics chapter, using a small dataset on sedges that we found in Gaetan and Guyon’s Spatial Statistics and Modelling. I am almost done tonight, with both path sampling and ABC R codes documented and working for this dataset. But I’d like to re-run both codes for longer to achieve smoother outcomes.

## yet more questions about Monte Carlo Statistical Methods

Posted in Books, Statistics, University life with tags , , , , , , , , , , on December 8, 2011 by xi'an

As a coincidence, here is the third email I this week about typos in Monte Carlo Statistical Method, from Peng Yu this time. (Which suits me well in terms of posts as  I am currently travelling to Provo, Utah!)

I’m reading the section on importance sampling. But there are a few cases in your book MCSM2 that are not clear to me.

On page 96: “Theorem 3.12 suggests looking for distributions g for which |h|f/g is almost constant with finite variance.”

What is the precise meaning of “almost constant”? If |h|f/g is almost constant, how come its variance is not finite?

“Almost constant” is not a well-defined property, I am afraid. By this sentence on page 96 we meant using densities g that made |h|f/g as little varying as possible while being manageable. Hence the insistence on the finite variance. Of course, the closer |h|f/g is to a constant function the more likely the variance is to be finite.

“It is important to note that although the finite variance constraint is not necessary for the convergence of (3.8) and of (3.11), importance sampling performs quite poorly when (3.12) ….”

It is not obvious to me why when (3.12) importance sampling performs poorly. I might have overlooked some very simple facts. Would you please remind me why it is the case? From the previous discussion in the same section, it seems that h(x) is missing in (3.12). I think that (3.12) should be (please compare with the first equation in section 3.3.2)

$\int h^2(x) f^2(x) / g(x) \text{d}x = + \infty$

The preference for a finite variance of f/g and against (3.12) is that we would like the importance function g to work well for most integrable functions h. Hence a requirement that the importance weight f/g itself behaves well. It guarantees some robustness across the h‘s and also avoids checking for the finite variance (as in your displayed equation) for all functions h that are square-integrable against g, by virtue of the Cauchy-Schwarz inequality.

## Time series

Posted in Books, R, Statistics with tags , , , , , , on March 29, 2011 by xi'an

(This post got published on The Statistics Forum yesterday.)

The short book review section of the International Statistical Review sent me Raquel Prado’s and Mike West’s book, Time Series (Modeling, Computation, and Inference) to review. The current post is not about this specific book, but rather on why I am unsatisfied with the textbooks in this area (and correlatively why I am always reluctant to teach a graduate course on the topic). Again, I stress that the following is not specifically about the book by Raquel Prado and Mike West!

With the noticeable exception of Brockwell and Davis’ Time Series: Theory and Methods, most time-series books seem to suffer (in my opinion) from the same difficulty, which sums up as being unable to provide the reader with a coherent and logical description of/introduction to the field. (This echoes a complaint made by Håvard Rue a few weeks ago in Zurich.) Instead, time-series books appear to haphazardly pile up notions and techniques, theory and methods, without paying much attention to the coherency of the presentation. That’s how I was introduced to the field (even though it was by a fantastic teacher!) and the feeling has not left me since then. It may be due to the fact that the field stemmed partly from signal processing in engineering and partly from econometrics, but such presentations never achieve a Unitarian front on how to handle time-series. In particular, the opposition between the time domain and the frequency domain always escapes me. This is presumably due to my inability to see the relevance of the spectral approach, as harmonic regression simply appears (to me) as a special type of non-linear regression with sinusoidal regressors and with a well-defined likelihood that does not require Fourier frequencies nor periodogram (nor either spectral density estimation). Even within the time domain, I find the handling of stationarity  by time-series book to be mostly cavalier. Why stationarity is important is never addressed, which leads to the reader being left with the hard choice between imposing stationarity and not imposing stationarity. (My original feeling was to let the issue being decided by the data, but this is not possible!) Similarly, causality is often invoked as a reason to set constraints on MA coefficients, even though this resorts to a non-mathematical justification, namely preventing dependence on the future. I thus wonder if being an Unitarian (i.e. following a single logical process for analysing time-series data) is at all possible in the time-series world! E.g., in Bayesian Core, we processed AR, MA, ARMA models in a single perspective, conditioning on the initial values of the series and imposing all the usual constraints on the roots of the lag polynomials but this choice was far from perfectly justified…

## Andrew’s criticisms

Posted in Books, Statistics with tags , , , , , , , , on January 23, 2010 by xi'an

Andrew Gelman has just written a most entertaining review of “Introducing Monte Carlo Methods with R” on his blog. The first sentence is ominous as the book seemingly reminded him of communists and fascists…! The explanation for this frightening debut is that the connection between the components of statistics

… ↔ Probability theory ↔ Theoretical statistics↔Statistical methodology ↔ Applications ↔ Computation ↔ Probability theory ↔ …

may be seen as a torus just as the range of political ideologies, the argument being that both George and I switched from proving mathematical minimaxity theorems about James-Stein estimators to proving convergence theorems. about Metropolis-Hastings algorithms. After pondering Andrew’s lines for a while, I am far from sure this is a positive assessment of Introducing Monte Carlo Methods with R! Indeed, at the first glance, it may give the blog reader the feeling that this is yet another theoretical book about Monte Carlo methods, written by theorists and mainly for theorists (Andrew wrote “applied researchers such as myself will get much more use out of theory as applied to computation“)… While we strive to distance ourselves from making a baby version of Monte Carlo Statistical Method, choosing the format of a Use R! book to clarify even further the purpose of the book: to lead (our students and) our readers to understand Monte Carlo methods through worked-out examples to the point of developping their own methods, while keeping the theory at bay.

A second read shows that Andrew’s point is much more subtle, namely that as (formerly?) mathematical statisticians, we have adopted a terse style that (maybe unconsciously) shy way from giving too much detail and explanations: once a definition is provided, it should suffice to itself! This leads to what Andrew calls little puzzles, where the reader needs to stop and reason out why things are as they are. (“I noticed a bunch of other examples of this sort, where the narrative just flows by and, as a reader, you have to stop and grab it. Lots of fun.”)  I noticed the same reactions from my students, so I quite agree with this point. When learning with a book, you need to sit with a piece of paper on one side (if the margins are too narrow), your computer on the other side and test everything for yourself. This is actually an intended feature, if not spelled out more clearly, and I thus appreciate very much Andrew’s conclusion that “it would also be an excellent book for a course on statistical computing“!

There is also Andrew’s comment that the book is ugly, which stings, but again can be seen in a different light.I obviously do not find Introducing Monte Carlo Methods with R ugly but the printing could have been indeed nicer and the fact that the printers used the jpeg versions of the figures instead of the postscript or pdf versions did not help. The raw R output presented verbatim in most pages is not particularly beautiful either, but this is truly intended, for readers who cannot test the code immediately (as when reading in the metro or listening to the course at the same time). The R programs are far from perfect R programs, but examples of what a “standard” beginner would do. I also agree with the suggestion of an epilogue: we wrote several times during the course of the book that we were not providing the big picture and that many aspects of the Monte Carlo methodology were not covered, but this would be worth repeating at the end, along with the few general recommendations we can make about better R programming. Another thing to add in the next edition!

A final interesting remark is that the very first comment on Andrew’s post was about solutions! This is a strong request from readers. nowadays, and thus seems like a compulsory element of publishing books with exercises. (As we discovered a wee too late for Bayesian Core!)

## “Introducing Monte Carlo Methods with R” is out!

Posted in Books, R, Statistics with tags , , , , , , on December 10, 2009 by xi'an

That’s it!, “Introducing Monte Carlo Methods with R” is out, truly out, I have received a copy from Springer by express mail today! (If you need any further proof, it is also advertised as In stock by Amazon.) Given that the printer exactly reproduces the pdf file sent to Springer, there is no element of surprise as in my earliest book (where I found a particularly horrendous typo made by the French publisher on the back cover!) but it is nonetheless a very pleasant feeling to take (finally!) hold of one’s new book! Since there must be remaining typos and even more obscure points, feel free to contact George Casella or myself for corrections and precisions.