## Goats do room

Posted in Books, Kids, R, Statistics, Wines with tags , , , , , , , , , , , , on July 16, 2022 by xi'an

The riddle of the week is about 10 goats sequentially moving to their room, which they have chosen at random and independently (among ten rooms), unless another goat already occupies the room, in which case they move to the first free room with a higher number or fail. What is the probability that all goats end up in a room?

Coding the experiment is straightforward:

```g=sample(1:N,N,rep=TRUE)
o=0*g
for(i in 1:N){
if(min(o[g[i]:N])){f=f+1;break()
}else{
o[min(which(!o[g[i]:N]))+g[i]-1]=1
}}}
```

returning an estimated probability of approximately 0.764.

As I had some free time during the early mornings at ISBA 2022, I tried to reformulate the question as a continuous event on uniform order statistics, turning to be at most one uniform larger than (N-1)/N, at most two larger than (N-2)/N, and so on… Asking the question on math.stackexchange quickly produced an answer that reversed engineered my formulation back to the goats (or parking lot), with a generic probability of

$\dfrac{(N+1)^{N-1}}{N^N}$

which of course coincides with the Monte Carlo approximation!

As an aside, I once drank South-African wines named Goats-do-Roam and Goat-Roti at my friends Jim and Maria’s place,  and they were quite enjoyable!

## a pseudo-marginal perspective on the ABC algorithm

Posted in Mountains, pictures, Statistics, University life with tags , , , , , , , , on May 5, 2014 by xi'an

My friends Luke Bornn, Natesh Pillai and Dawn Woodard just arXived along with Aaron Smith a short note on the convergence properties of ABC. When compared with acceptance-rejection or regular MCMC. Unsurprisingly, ABC does worse in both cases. What is central to this note is that ABC can be (re)interpreted as a pseudo-marginal method where the data comparison step acts like an unbiased estimator of the true ABC target (not of the original ABC target, mind!). From there, it is mostly an application of Christophe Andrieu’s and Matti Vihola’s results in this setup. The authors also argue that using a single pseudo-data simulation per parameter value is the optimal strategy (as compared with using several), when considering asymptotic variance. This makes sense in terms of simulating in a larger dimensional space but what of the cost of producing those pseudo-datasets against the cost of producing a new parameter? There are a few (rare) cases where the datasets are much cheaper to produce.

## author rank

Posted in Statistics with tags , , , , , on October 11, 2012 by xi'an

Got the following email from Amazon:

Today we have added a new feature, Amazon Author Rank, the definitive list of best-selling authors on Amazon.com. This list makes it easy for readers to discover the best-selling authors on Amazon.com overall and within a selection of major genres. Your Amazon Author Rank is 44,881 in Print Books.

It is a new feature so, with a very limited past horizon, this rank seems to be moving wildly! (For instance, it is now 36,776, just a few hours later.) But so are the individual book sales. Hence a clear lack of smoothing in the indicator.

Another interesting feature of this Author Central facility is the display of US sales by district, Not only because it shows that New York and San Francisco are the cities where I sell the most books (great!) but also because it uses the notion of “combined areas”, aggregating “the copies sold in these sparsely populated areas in order to obscure any single retailer’s sales”. A good display of data protection (even though the level of aggregation sounds too high to me, resulting in “combined areas” being the 3rd highest sale area. And including Gainesville, Florida and Ithaca, New York, the two latest locations of George Casella, in this combination!

## Who’s #1?

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , , , , , , on May 2, 2012 by xi'an

First, apologies for this teaser of a title! This post is not about who is #1 in whatever category you can think of, from statisticians to climbs [the Eiger Nordwand, to be sure!], to runners (Gebrselassie?), to books… (My daughter simply said “c’est moi!” when she saw the cover of this book on my desk.) So this is in fact a book review of…a book with this catching title I received a month or so ago!

We decided to forgo purely statistical methodology, which is probably a disappointment to the hardcore statisticians.” A.N. Langville & C.D. Meyer, Who’s #1? The Science of Rating and Ranking (page 225)

This book may be one of the most boring ones I have had to review so far! The reason for this disgruntled introduction to “Who’s #1? The Science of Rating and Ranking” by Langville and Meyer is that it has very little if any to do with statistics and modelling. (And also that it is mostly about American football, a sport I am not even remotely interested in.) The purpose of the book is to present ways of building rating and ranking within a population, based on pairwise numerical connections between some members of this population. The methods abound, at least eight are covered by the book, but they all suffer from the same drawback that they are connected to no grand truth, to no parameter from an underlying probabilistic model, to no loss function that would measure the impact of a “wrong” rating. (The closer it comes to this is when discussing spread betting in Chapter 9.) It is thus a collection of transformation rules, from matrices to ratings. I find this the more disappointing in that there exists a branch of statistics called ranking and selection that specializes in this kind of problems and that statistics in sports is a quite active branch of our profession, witness the numerous books by Jim Albert. (Not to mention Efron’s analysis of baseball data in the 70’s.)

First suppose that in some absolutely perfect universe there is a perfect rating vector.” A.N. Langville & C.D. Meyer, Who’s #1? The Science of Rating and Ranking (page 117)

The style of the book is disconcerting at first, and then some, as it sounds written partly from Internet excerpts (at least for most of the pictures) and partly from local student dissertations… The mathematical level is highly varying, in that the authors take the pain to define what a matrix is (page 33), only to jump to Perron-Frobenius theorem a few pages later (page 36). It also mentions Laplace’s succession rule (only justified as a shrinkage towards the center, i.e. away from 0 and 1), the Sinkhorn-Knopp theorem, the traveling salesman problem, Arrow and Condorcet, relaxation and evolutionary optimization, and even Kendall’s and Spearman’s rank tests (Chapter 16), even though no statistical model is involved. (Nothing as terrible as the completely inappropriate use of Spearman’s rho coefficient in one of Belfiglio’s studies…)

Since it is hard to say which ranking is better, our point here is simply that different methods can produce vastly different rankings.” A.N. Langville & C.D. Meyer, Who’s #1? The Science of Rating and Ranking (page 78)

I also find irritating the association of “science” with “rating”, because the techniques presented in this book are simply tricks to turn pairwise comparison into a general ordering of a population, nothing to do with uncovering ruling principles explaining the difference between the individuals. Since there is no validation for one ordering against another, we can see no rationality in proposing any of those, except to set a convention. The fascination of the authors for the Markov chain approach to the ranking problem is difficult to fathom as the underlying structure is not dynamical (there is not evolving ranking along games in this book) and the Markov transition matrix is just constructed to derive a stationary distribution, inducing a particular “Markov” ranking.

The Elo rating system is the epitome of simple elegance.” A.N. Langville & C.D. Meyer, Who’s #1? The Science of Rating and Ranking (page 64)

An interesting input of the book is its description of the Elo ranking system used in chess, of which I did not know anything apart from its existence. Once again, there is a high degree of arbitrariness in the construction of the ranking, whose sole goal is to provide a convention upon which most people agree. A convention, mind, not a representation of truth! (This chapter contains a section on the Social Network movie, where a character writes a logistic transform on a window, missing the exponent. This should remind Andrew of someone he often refer to in his blog!)

Perhaps the largest lesson is not to put an undue amount of faith in anyone’s rating.” A.N. Langville & C.D. Meyer, Who’s #1? The Science of Rating and Ranking (page 125)

In conclusion, I see little point in suggesting reading this book, unless one is interested in matrix optimization problems and/or illustrations in American football… Or unless one wishes to write a statistics book on the topic!

## Posts of the year

Posted in Books, R, Statistics, University life with tags , , , , , , , on August 31, 2011 by xi'an

Like last year, here are the most popular posts since last August:

No major surprise in this ranking: R related blogs keep the upper part, partly thanks to being syndicated on R-bloggers, partly thanks to the tribunes contributed by Ross Ihaka and Julien Cornebise, even though I am surprised a rather low-key Le Monde puzzle made it to the list (maybe because it became part of my latest R exam?). Controversial books reviews are great traffic generators, even though the review of The foundations of Statistics: a simulation-based approach was posted less than a month ago. At last, it is comforting to see two of our major research papers for the 2010-2011 period on the list: the Parallel processing of independent Metropolis-Hastings algorithms with Pierre and Murray, and the more controversial Lack of confidence in ABC model choice with Jean-Michel and Natesh (twice). The outlier in the list is undoubtedly Bayes on the Beach 2010 [2] which got undeserved traffic for pointing out to Surfers Paradise , a highly popular entry! On my side unscientific entries, Saunderson’s Mistborn and Larson’s Millenium, McCarthy’s Border trilogy missing the top list by three entries…