## chain event graphs [RSS Midlands seminar]

Posted in pictures, Statistics, University life with tags , , , , , , , , , , on October 16, 2013 by xi'an

Last evening, I attended the RSS Midlands seminar here in Warwick. The theme was chain event graphs (CEG), As I knew nothing about them, it was worth my time listening to both speakers and discussing with Jim Smith afterwards. CEGs are extensions of Bayes nets with originally many more nodes since they start with the probability tree involving all modalities of all variables. Intensive Bayesian model comparison is then used to reduce the number of nodes by merging modalities having the same children or removing variables with no impact on the variable of interest. So this is not exactly a new Bayes net based on modality dummies as nodes (my original question). This is quite interesting, esp. in the first talk illustration of using missing value indicators as a supplementary variable (to determine whether or not data is missing at random). I also wonder how much of a connection there is with variable length Markov chains (either as a model or as a way to prune the tree). A last vague idea is a potential connection with lumpable Markov chains, a concept I learned from Kemeny & Snell (1960): a finite Markov chain is lumpable if by merging two or more of its states it remains a Markov chain. I do not know if this has ever been studied from a statistical point of view, i.e. testing for lumpability, but this sounds related to the idea of merging modalities of some variables in the probability tree…

## A repulsive random walk

Posted in R, Statistics with tags , , , , on May 28, 2010 by xi'an

Matt Asher posted an R experiment on R-bloggers yesterday simulating the random walk

$x_{t+1} = x_t + \varepsilon_t / x_t$

which has the property of avoiding zero by quickly switching to a large value as soon as $x_t$ is small. He was then wondering about the “convergence” of the random walk given that it moves very little once $x_t$ is large enough. The values he found for various horizons t seemed to indicate a stable regime.

I reran the same experiment as Matt in a Monte Carlo perspective, using the R program

resu=matrix(0,ncol=100,nrow=25)
sampl=rnorm(100)
for (i in 1:25){
for (t in 2^(i-1):2^i) sampl=sampl+rnorm(100)/sampl
resu[i,]=sampl
}
boxplot(as.data.frame(t(abs(resu))),name=as.character(1:25),col="wheat3")

The outcome of this R code plotted above shows that the range and the average of the 100 replications is increasing with t. This behaviour indicates a transient behaviour of the Markov chain, which almost surely goes to infinity and never comes back (because at infinity the variance is zero). Another indication for transience is shown by the fact that $x_t$ comes back to the interval (-1,1) with probability $\Phi(-|x_t|)$, a probability which goes to zero with $x_t$. As suggested to me by Randal Douc, this transience can be established rigorously by considering

$x_{t+1}^2 = x_t^2 + 2\epsilon_t + \epsilon_t^2/x_t^2 > x_t^2 + 2\epsilon_t>2\sum_{i=1}^t \epsilon_t$

which is thus bounded from below by a null recurrent process, which almost surely goes to infinity. Therefore the above Markov chain cannot have a stationary distribution or even a stationary measure: it almost surely goes to (plus or minus) infinity.

## Computational Statistics

Posted in Books, R, Statistics with tags , , , , , , , , , , , , , , , , , , on May 10, 2010 by xi'an

Do not resort to Monte Carlo methods unnecessarily.

When I received this 2009 Springer-Verlag book, Computational Statistics, by James Gentle a while ago, I briefly took a look at the table of contents and decided to have a better look later… Now that I have gone through the whole book, I can write a short review on its scope and contents (to be submitted). Despite its title, the book aims at covering both computational statistics and statistical computing. (With 752 pages at his disposal, Gentle can afford to do both indeed!)

The book Computational Statistics is separated into four parts:

• Part I: Mathematical and statistical preliminaries.
• Part II: Statistical Computing (Computer storage and arithmetic.- Algorithms and programming.- Approximation of functions and numerical quadrature.- Numerical linear algebra.- Solution of nonlinear equations and optimization.- Generation of random numbers.)
• Part III: Methods of Computational Statistics (Graphical methods in computational statistics.- Tools for identification of structure in data.- Estimation of functions.- Monte Carlo methods for statistical inference.- Data randomization, partitioning, and augmentation.- Bootstrap methods.)
• Part IV: Exploring Data Density and Relationship (Estimation of probability density functions using parametric models.- Nonparametric estimation of probability density functions.- Statistical learning and data mining.- Statistical models of dependencies.)

Computational inference, together with exact inference and asymptotic inference, is an important component of statistical methods.

The first part of Computational Statistics is indeed a preliminary containing essentials of math and probability and statistics. A reader unfamiliar with too many topics within this chapter should first consider improving his or her background in the corresponding area! This is a rather large chapter, with 82 pages, and it should not be extremely useful to readers, except to signal deficiencies in their background, as noted above. Given this purpose, I am not certain the selected exercises of this chapter are necessary (especially when considering that some involve tools introduced much later in the book).

The form of a mathematical expression and the way the expression should be evaluated in actual practice may be quite different .

The second part of Computational Statistics is truly about computing, meaning the theory of computation, i.e. of using computers for numerical approximation, with discussions about the representation of numbers in computers, approximation errors, and of course random number generators. While I judge George Fishman’s Monte Carlo to provide a deeper and more complete coverage of those topics, I appreciate the need for reminding our students of those hardware subtleties as they often seem unaware of them, despite their advanced computer skills. This second part is thus a crash course of 250 pages on numerical methods (like function approximations by basis functions and …) and on random generators, i.e. cover the same ground as Gentle’s earlier books, Random Number Generation and Monte Carlo Methods and Numerical Linear Algebra for Applications in Statistics, while the more recent Elements of Computational Statistics looks very much like a shorter entry on the same topics as those of Parts III IV of Computational Statistics. This part could certainly sustain a whole semester undergraduate course while only advanced graduate students could be expected to gain from a self-study of those topics. It is nonetheless the most coherent and attractive part of the book. It constitutes a must-read for all students and researchers engaging into any kind of serious programming. Obviously, some notions are introduced a bit too superficially, given the scope of this section (as for instance Monte Carlo methods, in particular MCMC techniques that are introduced in less than six pages), but I came to realise this is the point of the book, which provides an entry into “all” necessary topics, along with links to the relevant literature (if missing Monte Carlo Statistical Methods!). I however deplore that the important issue of Monte Carlo experiments, whose construction is often a hardship for students, is postponed till the 100 page long appendix. (I suspect that students do not read appendices is another of those folk theorems!)

Monte Carlo methods differ from other methods of numerical analysis in yielding an estimate rather than an approximation.

The third and fourth parts of the book cover methods of computational statistics, including Monte Carlo methods, randomization and cross validation, the bootstrap, probability density estimation, and statistical learning. Unfortunately, I find the level of Part III to be quite uneven, where all chapters are short and rather superficial because they try to be all-encompassing. (For instance, Chapter 8 includes two pages on the RGB colour coding.) Part IV does a better job of presenting machine learning techniques, if not with the thoroughness of Hastie et al.’s The Elements of Statistical Learning: Data Mining, Inference, and Prediction… It seems to me that the relevant sections of Part III would have fitted better where they belong in Part IV. For instance, Chapter 10 on estimation of functions only covers the evaluation of estimators of functions, postponing the construction of those estimators till Chapter 15. Jackknife is introduced on its own in Chapter 12 (not that I find this introduction ultimately necessary) without the bootstrap covered in eight pages in Chapter 13 (bootstrap for non-iid data is dismissed rather quickly, given the current research in the area). The first chapter of Part IV covers some (non-Bayesian) estimation approaches for parametric families, but I find this chapter somehow superfluous as it does not belong to the computational statistic domain (except as an approximation method, as stressed in Section 14.4). While Chapter 16 is a valuable entry on clustering and data-analysis tools like PCA, the final section on high dimensions feels out of context (and mentioning the curse of dimensionality only that close to the end of the book does not seem appropriate). Chapter 17 on dependent data is missing the rich literature on graphical models and their use in the determination of dependence structures.

Programming is the best way to learn programming (read that again) .

In conclusion, Computational Statistics is a very diverse book that can be used at several levels as textbook, as well as a reference for researchers (even if as an entry towards further and deeper references). The book is well-written, in a lively and personal style. (I however object to the reduction of the notion of Markov chains to discrete state-spaces!) There is no requirement for a specific programming language, although R is introduced in a somewhat dismissive way (R most serious flaw is usually lack of robustness since some [packages] are not of high-quality) and some exercises start with Design and write either a C or a Fortran subroutine. BUGS is not mentioned at all. The appendices of Computational Statistics also contain the solutions to some exercises, even though the level of detail is highly variable, from one word (“1″) to one page (see, e.g., Exercise 11.4). The 20 page list of references is preceded by a few pages on available journals and webpages, which could get obsolete rather quickly. Despite the reservations raised above about some parts of Computational Statistics that would benefit from a deeper coverage, I think this book is a reference book that should appear in the shortlist of any computational statistics/statistical computing graduate course as well as on the shelves of any researcher supporting his or her statistical practice with a significant dose of computing backup.