## a football post?!

Posted in Statistics with tags , , , , , , , , , , , , , , on June 22, 2022 by xi'an

I am not interested in football, neither as a player (a primary school trauma when I was the last being picked!) or as a fan, contrary to my dad (who was a football referee in his youth) and my kids, but Gareth Roberts (University of Warwick) and Jeff Rosenthal wrote a paper on football draws for the (FIFA) World Cup, infamously playing in Qatar by the end of the year, which Gareth presented in a Warwick seminar.

For this tournament, there are 32 teams, first playing against opponent teams supposedly drawn from a uniform distribution over all draw assignments, within 8 groups of 4 teams, with constraints like 1-2 EU teams per group, 0-1 from the other regions. As done at the moment and on TV, the tournament is filled one team at time by drawing from Pot 1, then Pot 2, then Pot 3, & Pot 4. &tc.. Applying the constraints one draw at a time, conditional on the past draws and the constraints, rather obviously creates non-uniformity! Uniformity would be achievable by rejection sampling (with a success probability of 1/540!) But this is not televisesque enough…

A debiasing solution is found by using several balls for each team in the right proportion, correcting for the sequential draws. Still impractical when requiring 10¹⁴ balls…!

The fun in their paper is that the problem can be formulated as a particle filter, estimating the right probabilities by randomising the number of balls [hidden randomness] and estimating the probability for team j to be included by a few thousands draws. With some stratified sampling on the side to minimise randomness. Removing the need for the (intractable?) distribution is thus achieved by retrospective sampling, as in pseudo-marginal MCMC. Alternatively, one could swap pairs of teams by a simplistic MCMC algorithm, with no worry about stationarity and the possibility of on-screen draws. (Jeff devised a Java applet to simulate an actual draw.) Obviously, it is still a far stretch that this proposal will be implemented for the next World Cup. If so, I will watch it!

## multilevel linear models, Gibbs samplers, and multigrid decompositions

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , on October 22, 2021 by xi'an

A paper by Giacommo Zanella (formerly Warwick) and Gareth Roberts (Warwick) is about to appear in Bayesian Analysis and (still) open for discussion. It examines in great details the convergence properties of several Gibbs versions of the same hierarchical posterior for an ANOVA type linear model. Although this may sound like an old-timer opinion, I find it good to have Gibbs sampling back on track! And to have further attention to diagnose convergence! Also, even after all these years (!), it is always a surprise  for me to (re-)realise that different versions of Gibbs samplings may hugely differ in convergence properties.

At first, intuitively, I thought the options (1,0) (c) and (0,1) (d) should be similarly performing. But one is “more” hierarchical than the other. While the results exhibiting a theoretical ordering of these choices are impressive, I would suggest pursuing an random exploration of the various parameterisations in order to handle cases where an analytical ordering proves impossible. It would most likely produce a superior performance, as hinted at by Figure 4. (This alternative happens to be briefly mentioned in the Conclusion section.) The notion of choosing the optimal parameterisation at each step is indeed somewhat unrealistic in that the optimality zones exhibited in Figure 4 are unknown in a more general model than the Gaussian ANOVA model. Especially with a high number of parameters, parameterisations, and recombinations in the model (Section 7).

An idle question is about the extension to a more general hierarchical model where recentring is not feasible because of the non-linear nature of the parameters. Even though Gaussianity may not be such a restriction in that other exponential (if artificial) families keeping the ANOVA structure should work as well.

Theorem 1 is quite impressive and wide ranging. It also reminded (old) me of the interleaving properties and data augmentation versions of the early-day Gibbs. More to the point and to the current era, it offers more possibilities for coupling, parallelism, and increasing convergence. And for fighting dimension curses.

“in this context, imposing identifiability always improves the convergence properties of the Gibbs Sampler”

Another idle thought of mine is to wonder whether or not there is a limited number of reparameterisations. I think that by creating unidentifiable decompositions of (some) parameters, eg, μ=μ¹+μ²+.., one can unrestrictedly multiply the number of parameterisations. Instead of imposing hard identifiability constraints as in Section 4.2, my intuition was that this de-identification would increase the mixing behaviour but this somewhat clashes with the above (rigorous) statement from the authors. So I am proven wrong there!

Unless I missed something, I also wonder at different possible implementations of HMC depending on different parameterisations and whether or not the impact of parameterisation has been studied for HMC. (Which may be linked with Remark 2?)

## back to the Bernoulli factory

Posted in Books, Statistics, University life with tags , , , on April 7, 2020 by xi'an

“The results show that the proposed algorithm is asymptotically optimal for the mentioned subclass of functions, in the sense that for any other fast algorithm E[N] grows at least as fast with p.”

Murray Pollock (Warwick U. for a wee more days!) pointed out to me this paper of Luis Mendo on a Bernoulli factory algorithm that estimates functions [of p] that can be expressed as power series [of p]. Essentially functions f(p) such that f(0)=0 and f(1)=1. The big difference with earlier algorithms, as far as I can tell, is that the approach involves a randomised stopping rule that involves, on top of the unlimited sequence of Bernoulli B(p) variates a second sequence of Uniform variates, which sounds to me like a change of paradigm, given the much higher degree of freedom brought by Uniform variates (as opposed to Bernoulli variates with an unknown value of p). Although there exists a non-randomised version in the paper. The proposed algorithm is as follows, using a sequence of d’s issued from the power series coefficients:

1. Set i=1.
2. Take one input X[i].
3. Produce U[i] uniform on (0,1). Let V[i]=1 if U[i]<d[i] and V[i]=0 otherwise.
If V[i] or X[i] are equal to 1, output X[i] and finish.
Else increase i and go back to step2.

As the author mentions, this happens to be a particular case of the reverse-time martingale approach of Łatuszynski, Kosmidis, Papaspiliopoulos and Roberts (Warwick connection as well!). With an average number of steps equal to f(p)/p, surprisingly simple, and somewhat of an optimal rate. While the functions f(p) are somewhat restricted, this is nice work

## double descent

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on November 7, 2019 by xi'an

Last Friday, I [and a few hundred others!] went to the SMILE (Statistical Machine Learning in Paris) seminar where Francis Bach was giving a talk. (With a pleasant ride from Dauphine along the Seine river.) Fancis was talking about the double descent phenomenon observed in recent papers by Belkin & al. (2018, 2019), and Mei & Montanari (2019). (As the seminar room at INRIA was quite crowded and as I was sitting X-legged on the floor close to the screen, I took a few slides from below!) The phenomenon is that the usual U curve warning about over-fitting and reproduced in most statistics and machine-learning courses can under the right circumstances be followed by a second decrease in the testing error when the number of features goes beyond the number of observations. This is rather puzzling and counter-intuitive, so I briefkly checked the 2019 [8 pages] article by Belkin & al., who are studying two examples, including a standard “large p small n” Gaussian regression. where the authors state that

“However, as p grows beyond n, the test risk again decreases, provided that the model is fit using a suitable inductive bias (e.g., least norm solution). “

One explanation [I found after checking the paper] is that the variates (features) in the regression are selected at random rather than in an optimal sequential order. Double descent is missing with interpolating and deterministic estimators. Hence requiring on principle all candidate variates to be included to achieve minimal averaged error. The infinite spike is when the number p of variate is near the number n of observations. (The expectation accounts as well for the randomisation in T. Randomisation that remains an unclear feature in this framework…)

## O’Bayes in action

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , , , on November 7, 2017 by xi'an

My next-door colleague [at Dauphine] François Simenhaus shared a paradox [to be developed in an incoming test!] with Julien Stoehr and I last week, namely that, when selecting the largest number between a [observed] and b [unobserved], drawing a random boundary on a [meaning that a is chosen iff a is larger than this boundary] increases the probability to pick the largest number above ½2…

When thinking about it in the wretched RER train [train that got immobilised for at least two hours just a few minutes after I went through!, good luck to the passengers travelling to the airport…] to De Gaulle airport, I lost the argument: if a<b, the probability [for this random bound] to be larger than a and hence for selecting b is 1-Φ(a), while, if a>b, the probability [of winning] is Φ(a). Hence the only case when the probability is ½ is when a is the median of this random variable. But, when discussing the issue further with Julien, I exposed an interesting non-informative prior characterisation. Namely, if I assume a,b to be iid U(0,M) and set an improper prior 1/M on M, the conditional probability that b>a given a is ½. Furthermore, the posterior probability to pick the right [largest] number with François’s randomised rule is also ½, no matter what the distribution of the random boundary is. Now, the most surprising feature of this coffee room derivation is that these properties only hold for the prior 1/M. Any other power of M will induce an asymmetry between a and b. (The same properties hold when a,b are iid Exp(M).)  Of course, this is not absolutely unexpected since 1/M is the invariant prior and since the “intuitive” symmetry only holds under this prior. Power to O’Bayes!

When discussing again the matter with François yesterday, I realised I had changed his wording of the puzzle. The original setting is one with two cards hiding the unknown numbers a and b and of a player picking one of the cards. If the player picks a card at random, there is indeed a probability of ½ of picking the largest number. If the decision to switch or not depends on an independent random draw being larger or smaller than the number on the observed card, the probability to get max(a,b) in the end hits 1 when this random draw falls into (a,b) and remains ½ outside (a,b). Randomisation pays.