## a football post?!

Posted in Statistics with tags , , , , , , , , , , , , , , on June 22, 2022 by xi'an

I am not interested in football, neither as a player (a primary school trauma when I was the last being picked!) or as a fan, contrary to my dad (who was a football referee in his youth) and my kids, but Gareth Roberts (University of Warwick) and Jeff Rosenthal wrote a paper on football draws for the (FIFA) World Cup, infamously playing in Qatar by the end of the year, which Gareth presented in a Warwick seminar.

For this tournament, there are 32 teams, first playing against opponent teams supposedly drawn from a uniform distribution over all draw assignments, within 8 groups of 4 teams, with constraints like 1-2 EU teams per group, 0-1 from the other regions. As done at the moment and on TV, the tournament is filled one team at time by drawing from Pot 1, then Pot 2, then Pot 3, & Pot 4. &tc.. Applying the constraints one draw at a time, conditional on the past draws and the constraints, rather obviously creates non-uniformity! Uniformity would be achievable by rejection sampling (with a success probability of 1/540!) But this is not televisesque enough…

A debiasing solution is found by using several balls for each team in the right proportion, correcting for the sequential draws. Still impractical when requiring 10¹⁴ balls…!

The fun in their paper is that the problem can be formulated as a particle filter, estimating the right probabilities by randomising the number of balls [hidden randomness] and estimating the probability for team j to be included by a few thousands draws. With some stratified sampling on the side to minimise randomness. Removing the need for the (intractable?) distribution is thus achieved by retrospective sampling, as in pseudo-marginal MCMC. Alternatively, one could swap pairs of teams by a simplistic MCMC algorithm, with no worry about stationarity and the possibility of on-screen draws. (Jeff devised a Java applet to simulate an actual draw.) Obviously, it is still a far stretch that this proposal will be implemented for the next World Cup. If so, I will watch it!

## averaged acceptance ratios

Posted in Statistics with tags , , , , , , , , , , , , , on January 15, 2021 by xi'an

In another recent arXival, Christophe Andrieu, Sinan Yıldırım, Arnaud Doucet, and Nicolas Chopin study the impact of averaging estimators of acceptance ratios in Metropolis-Hastings algorithms. (It is connected with the earlier arXival rephrasing Metropolis-Hastings in terms of involutions discussed here.)

“… it is possible to improve performance of this algorithm by using a modification where the acceptance ratio r(ξ) is integrated with respect to a subset of the proposed variables.”

This interpretation of the current proposal makes it a form of Rao-Blackwellisation, explicitly mentioned on p.18, where, using a mixture proposal, with an adapted acceptance probability, it depends on the integrated acceptance ratio only. Somewhat magically using this ratio and its inverse with probability ½. And it increases the average Metropolis-Hastings acceptance probability (albeit with a larger number of simulations). Since the ideal averaging is rarely available, the authors implement a Monte Carlo averaging version. With applications to the exchange algorithm and to reversible jump MCMC. The major application is to pseudo-marginal settings with a high complexity (in the number T of terms) and where the authors’ approach does scale efficiently with T. There is even an ABC side to the story as one illustration is made of the ABC approximation to the posterior of an α-stable sample. As an encompassing proposal for handling Metropolis-Hastings environments with latent variables and several versions of the acceptance ratios, this is quite an interesting paper that I think we will study in further detail with our students.

## general perspective on the Metropolis–Hastings kernel

Posted in Books, Statistics with tags , , , , , , , , , , , , , on January 14, 2021 by xi'an

[My Bristol friends and co-authors] Christophe Andrieu, and Anthony Lee, along with Sam Livingstone arXived a massive paper on 01 January on the Metropolis-Hastings kernel.

“Our aim is to develop a framework making establishing correctness of complex Markov chain Monte Carlo kernels a purely mechanical or algebraic exercise, while making communication of ideas simpler and unambiguous by allowing a stronger focus on essential features (…) This framework can also be used to validate kernels that do not satisfy detailed balance, i.e. which are not reversible, but a modified version thereof.”

A central notion in this highly general framework is, extending Tierney (1998), to see an MCMC kernel as a triplet involving a probability measure μ (on an extended space), an involution transform φ generalising the proposal step (i.e. þ²=id), and an associated acceptance probability ð. Then μ-reversibility occurs for

$\eth(\xi)\mu(\text{d}\xi)= \eth(\phi(\xi))\mu^{\phi}(\text{d}\xi)$

with the rhs involving the push-forward measure induced by μ and φ. And furthermore there is always a choice of an acceptance probability ð ensuring for this equality to happen. Interestingly, the new framework allows for mostly seamless handling of more complex versions of MCMC such as reversible jump and parallel tempering. But also non-reversible kernels, incl. for instance delayed rejection. And HMC, incl. NUTS. And pseudo-marginal, multiple-try, PDMPs, &c., &c. it is remarkable to see such a general theory emerging a this (late?) stage of the evolution of the field (and I will need more time and attention to understand its consequences).

## unbiased estimator of determinant

Posted in Books, Statistics with tags , , , , on July 3, 2020 by xi'an

Just saw this new [one page] posting on arXiv, meaning an unbiased estimate of the determinant can be derived much faster. If less reliably. This trick can be helpful for (pseudo-marginal) MCMC steps when the determinant itself is of limited interest… (The importance version is not truly needed!)

## is there such a thing as optimal subsampling?

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , on June 12, 2020 by xi'an

This idea of optimal thinnin and burnin has been around since the early days of the MCMC revolution and did not come up with a definite answer. For instance, from a pure estimation perspective, subsampling always increases the variance of the resulting estimator. My personal approach is to ignore both burnin and thinnin and rather waste time on running several copies of the code to check for potential discrepancies and get a crude notion of the variability. And to refuse to answer to questions like is 5000 iterations long enough for burnin?

A recent arXival by Riabiz et al. readdresses the issue. In particular concerning this notion that the variance of the subsampled version is higher: this only applies to a deterministic subsampling, as opposed to an MCMC-based subsampling (although this intricacy only makes the problem harder!). I however fail to understand the argument in favour of subsampling based on storage issues (p.4), as a dynamic storage of the running mean for all quantities of interest does not cost anything if the integrand is not particularly demanding. I also disagree at the pessimistic view that the asymptotic variance of the MCMC estimate is hard to estimate: papers by Flegal, Hobert, Jones, Vat and others have rather clearly shown how batch means can produce converging estimates of this asymptotic variance.

“We do not to attempt to solve a continuous optimisation problem for selection of the next point [in the sample]. Such optimisation problems are fundamentally difficult and can at best be approximately solved. Instead, we exactly solve the discrete optimisation problem of selecting a suitable element from a supplied MCMC output.”

One definitely positive aspect of the paper is that the (thinning) method is called Stein thinning, in connection with Stein’s discrepancy, and this honours Charles Stein. The method looks at the optimal subsample, with optimality defined in terms of minimising Stein’s discrepancy from the true target over a reproducible kernel Hilbert space. And then over a subsample to minimise the distance from the empirical distribution to the theoretical distribution. The kernel (11) is based on the gradient of the target log density and the solution is determined by greedy algorithms that determine which next entry to add to the empirical distribution. Which is of complexity O(nm2) if the subsample is of size m. Some entries may appear more than once and the burnin step could be automatically included as (relatively) unlikely values are never selected (at least this was my heuristic understanding). While the theoretical backup for the construct is present and backed by earlier papers of some of the authors, I do wonder at the use of the most rudimentary representation of an approximation to the target when smoother versions could have been chosen and optimised on the same ground. And I am also surprised at the dependence of both estimators and discrepancies on the choice of the (sort-of) covariance matrix in the inner kernel, as the ODE examples provided in the paper (see, e.g., Figure 7). (As an aside and at a shallow level, the approach also reminded me of the principal points of my late friend Bernhard Flury…) Storage of all MCMC simulations for a later post-processing is of course costly in terms of storage, at O(nm). Unless a “secretary problem” approach can be proposed to get sequential. Another possible alternate would be to consider directly the chain of the accepted values (à la vanilla Rao-Blackwellisation). Overall, since the stopping criterion is based on a fixed sample size, and hence depends on the sub-efficiency of evaluating the mass of different modes, I am unsure the method is anything but what-you-get-is-what-you-see, i.e. prone to get misled by a poor exploration of the complete support of the target.

“This paper focuses on nonuniform subsampling and shows that it is more efficiency than uniform subsampling.”

Two weeks later, Guanyu Hu and Hai Ying Wang arXived their Most Likely Optimal Subsampled Markov Chain Monte Carlo, in what I first thought as an answer to the above! But both actually have little in common as this second paper considers subsampling on the data, rather than the MCMC output, towards producing scalable algorithms. Building upon Bardenet et al. (2014) and Korattikara et al. (2014).  Replacing thus the log-likelihood with a random sub-sampled version and deriving the sample size based on a large deviation inequality. By a Cauchy-Schwartz inequality, the authors find sampling probabilities proportional to the individual log-likelihooods. Which depend on the running value of the MCMC’ed parameters. And thus replaced with the values at a fixed parameter, with cost O(n) but only once, but no so much optimal. (The large deviation inequality therein is only concerned with an approximation to the log-likelihood, without examining the long term impact on the convergence of the approximate Markov chain as this is no longer pseudo-marginal MCMC. For instance, both current and prospective log-likelihoods are re-estimated at each iteration. The paper compares with uniform sampling on toy examples,  to demonstrate a smaller estimation error for the statistical problem, rather than convergence to the true posterior.)