## a football post?!

Posted in Statistics with tags , , , , , , , , , , , , , , on June 22, 2022 by xi'an

I am not interested in football, neither as a player (a primary school trauma when I was the last being picked!) or as a fan, contrary to my dad (who was a football referee in his youth) and my kids, but Gareth Roberts (University of Warwick) and Jeff Rosenthal wrote a paper on football draws for the (FIFA) World Cup, infamously playing in Qatar by the end of the year, which Gareth presented in a Warwick seminar.

For this tournament, there are 32 teams, first playing against opponent teams supposedly drawn from a uniform distribution over all draw assignments, within 8 groups of 4 teams, with constraints like 1-2 EU teams per group, 0-1 from the other regions. As done at the moment and on TV, the tournament is filled one team at time by drawing from Pot 1, then Pot 2, then Pot 3, & Pot 4. &tc.. Applying the constraints one draw at a time, conditional on the past draws and the constraints, rather obviously creates non-uniformity! Uniformity would be achievable by rejection sampling (with a success probability of 1/540!) But this is not televisesque enough…

A debiasing solution is found by using several balls for each team in the right proportion, correcting for the sequential draws. Still impractical when requiring 10¹⁴ balls…!

The fun in their paper is that the problem can be formulated as a particle filter, estimating the right probabilities by randomising the number of balls [hidden randomness] and estimating the probability for team j to be included by a few thousands draws. With some stratified sampling on the side to minimise randomness. Removing the need for the (intractable?) distribution is thus achieved by retrospective sampling, as in pseudo-marginal MCMC. Alternatively, one could swap pairs of teams by a simplistic MCMC algorithm, with no worry about stationarity and the possibility of on-screen draws. (Jeff devised a Java applet to simulate an actual draw.) Obviously, it is still a far stretch that this proposal will be implemented for the next World Cup. If so, I will watch it!

## the paper where you are a node

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , on February 5, 2019 by xi'an

Sophie Donnet pointed out to me this arXived paper by Tianxi Li, Elizaveta Levina, and Ji Zhu, on a network resampling strategy for X validation, where I appear as a datapoint rather than as a [direct] citation! Which reminded me of the “where you are the hero” gamebooks with which my kids briefly played, before computer games took over. The model selection method is illustrated on a dataset made of X citations [reduced to 706 authors]  in all papers published between 2003 and 2012 in the Annals of Statistics, Biometrika, JASA, and JRSS Series B. With the outcome being the determination of a number of communities, 20, which the authors labelled as they wanted, based on 10 authors with the largest number of citations in the category. As it happens, I appear in the list, within the “mixed (causality + theory + Bayesian)” category (!), along with Jamie Robbins, Paul Fearnhead, Gilles Blanchard, Zhiqiang Tan, Stijn Vansteelandt, Nancy Reid, Jae Kwang Kim, Tyler VanderWeele, and Scott Sisson, which is somewhat mind-boggling in that I am pretty sure I never quoted six of these authors [although I find it hilarious that Jamie appears in the category, given that we almost got into a car crash together, at one of the Valencià meetings!].

## resampling methods

Posted in Books, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , on December 6, 2017 by xi'an

A paper that was arXived [and that I missed!] last summer is a work on resampling by Mathieu Gerber, Nicolas Chopin (CREST), and Nick Whiteley. Resampling is used to sample from a weighted empirical distribution and to correct for very small weights in a weighted sample that otherwise lead to degeneracy in sequential Monte Carlo (SMC). Since this step is based on random draws, it induces noise (while improving the estimation of the target), reducing this noise is preferable, hence the appeal of replacing plain multinomial sampling with more advanced schemes. The initial motivation is for sequential Monte Carlo where resampling is rife and seemingly compulsory, but this also applies to importance sampling when considering several schemes at once. I remember discussing alternative schemes with Nicolas, then completing his PhD, as well as Olivier Cappé, Randal Douc, and Eric Moulines at the time (circa 2004) we were working on the Hidden Markov book. And getting then a somewhat vague idea as to why systematic resampling failed to converge.

In this paper, Mathieu, Nicolas and Nick show that stratified sampling (where a uniform is generated on every interval of length 1/n) enjoys some form of consistent, while systematic sampling (where the “same” uniform is generated on every interval of length 1/n) does not necessarily enjoy this consistency. There actually exists cases where convergence does not occur. However, a residual version of systematic sampling (where systematic sampling is applied to the residuals of the decimal parts of the n-enlarged weights) is itself consistent.

The paper also studies the surprising feature uncovered by Kitagawa (1996) that stratified sampling applied to an ordered sample brings an error of O(1/n²) between the cdf rather than the usual O(1/n). It took me a while to even understand the distinction between the original and the ordered version (maybe because Nicolas used the empirical cdf during his SAD (Stochastic Algorithm Day!) talk, ecdf that is the same for ordered and initial samples).  And both systematic and deterministic sampling become consistent in this case. The result was shown in dimension one by Kitagawa (1996) but extends to larger dimensions via the magical trick of the Hilbert curve.

## Monte Carlo simulation and resampling methods for social science [book review]

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , on October 6, 2014 by xi'an

Monte Carlo simulation and resampling methods for social science is a short paperback written by Thomas Carsey and Jeffrey Harden on the use of Monte Carlo simulation to evaluate the adequacy of a model and the impact of assumptions behind this model. I picked it in the library the other day and browsed through the chapters during one of my métro rides. Definitely not an in-depth reading, so be warned before reading the [telegraphic] review!

Overall, I think the book is doing a good job of advocating the use of simulation to evaluate the pros and cons of a given model (rephrased as data generating process) when faced with data. And doing it in R. After some rudiments in probability theory and in R programming, it briefly explains the use of resident random generators if not of how to handle new distributions and then spend a large part of the book on simulation around generalised and regular linear models. For instance, in the linear model, the authors test the impact of heterocedasticity, multicollinearity, measurement error, omitted variable(s), serial correlation, clustered data, and heavy-tailed errors. While this is a perfect way of exploring those semi-hidden hypotheses behind the linear model, I wonder at the impact on students of this exploration. On the one hand, they will perceive the importance of those assumptions and hopefully remember them. On the other hand, and this is a very recurrent criticism of mine, this implies a lot of maturity from the students, i.e., they have to distinguish the data, the model [maybe] behind the data, the finite if large number of hypotheses one can test, and the interpretation of the outcome of a simulation test… Given that they were introduced to basic probability just a few chapters before, this expectation [from the students] may prove unrealistic. (And a similar criticism applies to the following chapters, from GLM to jackknife and bootstrap.)

At the end of the book, the authors ask the question as to how could a reader use the information in this book towards one’s work. Drafting a generic protocol for this reader, who is supposed to consider “alterations to the data generating process” (p.272) and to “identify a possible problem or assumption violation” (p.271). Thus requiring a readership “who has some training in quantitative methods” (p.1). And then some more. But I definitely sympathise with the goal of confronting models and theory with the harsh reality of simulation output!

## Banff workshop [BIRS 12w5105 meeting [#2]]

Posted in Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , , on March 21, 2012 by xi'an

Today the program of 12w5105 was more on the theoretical side with adaptive MCMC in the morning and ABC in the afternoon. Éric Moulines and Gersende Fort shared a talk on two papers, one on adaptive tempering and the other one on equi-energy sampling, then Nando de Freitas spoke first about Gaussian process approximation for Bayesian optimisation, then about an adaptive Hamiltonian technique called Sardonics. And Jeff Rosenthal concluded the morning with a review of the results ensuring convergence for adaptive MCMC (with a delightful counter-example called Stairways to Heaven that reminded me of an ice climb in Utah!). After my talk, where Scott Sisson made an interesting comment on the difficulty to extend our framework to a large collection of models (since then the summary statistics have to differ), François Perron discussed in highly interesting details several approximation techniques for the Bayesian estimation of copulas and Scott Sisson presented his recent arXiv paper where a rough estimate of the joint posterior is obtained regression-adjustment ABC, and then estimates of each marginal posterior distribution are separately obtained in a lower-dimensional analysis, all this being connected with Bayes linear analysis. (I do not completely get the way summary statistics are selected for each marginal there, which seems to be done by hand. While I understand why using a lower-dimensional statistic helps in improving the approximation of the marginal posteriors and fights the curse of dimensionality, the fact that the joint posterior sample is based on different summary statistics for the different components makes an interesting statistical puzzle. Maybe the copula approach by François in the previous talk could be used at the final stage.) The final talk by Zhiqiang Tan on comparative performances of resampling and subsampling strategies generated a very animated discussion. (All talks being recorded, mine is available as an mp4 video but watch at your own peril!)