## assessing MCMC convergence

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on June 6, 2019 by xi'an

When MCMC became mainstream in the 1990’s, there was a flurry of proposals to check, assess, and even guarantee convergence to the stationary distribution, as discussed in our MCMC book. Along with Chantal Guihenneuc and Kerrie Mengersen, we also maintained for a while a reviewww webpage categorising theses. Niloy Biswas and Pierre Jacob have recently posted a paper where they propose the use of couplings (and unbiased MCMC) towards deriving bounds on different metrics between the target and the current distribution of the Markov chain. Two chains are created from a given kernel and coupled with a lag of L, meaning that after a while, the two chains become one with a time difference of L. (The supplementary material contains many details on how to induce coupling.) The distance to the target can then be bounded by a sum of distances between the two chains until they merge. The above picture from the paper is a comparison a Polya-Urn sampler with several HMC samplers for a logistic target (not involving the Pima Indian dataset!). The larger the lag L the more accurate the bound. But the larger the lag the more expensive the assessment of how many steps are needed to convergence. Especially when considering that the evaluation requires restarting the chains from scratch and rerunning until they couple again, rather than continuing one run which can only brings the chain closer to stationarity and to being distributed from the target. I thus wonder at the possibility of some Rao-Blackwellisation of the simulations used in this assessment (while realising once more than assessing convergence almost inevitably requires another order of magnitude than convergence itself!). Without a clear idea of how to do it… For instance, keeping the values of the chain(s) at the time of coupling is not directly helpful to create a sample from the target since they are not distributed from that target.

[Pierre also wrote a blog post about the paper on Statisfaction that is definitely much clearer and pedagogical than the above.]

## Le Monde puzzle [#1088]

Posted in Books, Kids, R with tags , , , , , , , , on March 29, 2019 by xi'an

A board (Ising!) Le Monde mathematical puzzle in the optimisation mode, again:

On a 7×7 board, what is the maximal number of locations that one can occupy when imposing at least two empty neighbours ?

Which I tried to solve by brute force and simulated annealing (what else?!), first defining a target

targ=function(tabz){
sum(tabz[-c(1,9),-c(1,9)]-1.2*(tabz[-c(1,9),-c(1,9)]*tabz[-c(8,9),-c(1,9)]
+tabz[-c(1,9),-c(1,9)]*tabz[-c(1,2),-c(1,9)]
+tabz[-c(1,9),-c(1,9)]*tabz[-c(1,9),-c(8,9)]
+tabz[-c(1,9),-c(1,9)]*tabz[-c(1,9),-c(1,2)]>2))}


on a 9×9 board where I penalise prohibited configuration by a factor 1.2 (a wee bit more than empty nodes). The perimeter of the 9×9 board is filled with ones and never actualised. (In the above convoluted products, the goal is to count how many neighbours of the entries equal to one are also equal to one. More than 2 is penalised.) The simulated annealing move is then updating the 9×9 grid gridz:

temp=1
maxarg=curarg=targ(gridz)
for (t in 1:1e3){
for (v in 1:1e4){
i=sample(2:8,1);j=sample(2:8,1)
newgrid=gridz;newgrid[i,j]=1-gridz[i,j]
newarg=targ(newgrid)
if (log(runif(1))<temp*(newarg-curarg)){
gridz=newgrid;curarg=newarg}}
temp=temp+.01}


and calls to the procedure always return 28 entries as the optimum, as in

     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]    1    0    1    0    1    0    1
[2,]    0    1    1    0    1    1    0
[3,]    1    1    0    1    0    1    1
[4,]    0    0    1    0    1    0    0
[5,]    1    1    0    1    0    1    1
[6,]    0    1    1    0    1    1    0
[7,]    1    0    1    0    1    0    1


As it happens, I had misread the wording of the original puzzle, which considered a dynamic placement of the units on the board, one at a time with two free neighbours imposed.

## Le Monde puzzle [#1087]

Posted in Books, Kids, R, Statistics with tags , , , , , on February 25, 2019 by xi'an

A board-like Le Monde mathematical puzzle in the digit category:

Given a (k,m) binary matrix, what is the maximum number S of entries with only one neighbour equal to one? Solve for k=m=2,…,13, and k=6,m=8.

For instance, for k=m=2, the matrix

$\begin{matrix} 0 &0\\ 1 &1\\ \end{matrix}$

is producing the maximal number 4. I first attempted a brute force random filling of these matrices with only a few steps of explorations and got the numbers 4,8,16,34,44,57,… for the first cases. Since I was convinced that the square k² of a number k previously exhibited to reach its maximum as S=k² was again perfect in this regard, I then tried another approach based on Gibbs sampling and annealing (what else?):

gibzit=function(k,m,A=1e2,N=1e2){
temp=1 #temperature
board=sole=matrix(sample(c(0,1),(k+2)*(m+2),rep=TRUE),k+2,m+2)
board[1,]=board[k+2,]=board[,1]=board[,m+2]=0 #boundaries
maxol=counter(board,k,m) #how many one-neighbours?
for (t in 1:A){#annealing
for (r in 1:N){#basic gibbs steps
for (i in 2:(k+1))
for (j in 2:(m+1)){
prop=board
prop[i,j]=1-board[i,j]
u=runif(1)
if (log(u/(1-u))<temp*(counter(prop,k,m)-val)){
board=prop;val=counter(prop,k,m)
if (val>maxol){
maxol=val;sole=board}}
}}
temp=temp*2}
return(sole[-c(1,k+2),-c(1,m+2)])}


which leads systematically to the optimal solution, namely a perfect square k² when k is even and a perfect but one k²-1 when k is odd. When k=6, m=8, all entries can afford one neighbour exactly,

> gibzbbgiz(6,8)
[1] 48
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,]    1    0    0    1    1    0    0    1
[2,]    1    0    0    0    0    0    0    1
[3,]    0    0    1    0    0    1    0    0
[4,]    0    0    1    0    0    1    0    0
[5,]    1    0    0    0    0    0    0    1
[6,]    1    0    0    1    1    0    0    1


but this does not seem feasible when k=6, m=7, which only achieves 40 entries with one single neighbour.

## Bayesian inference with intractable normalizing functions

Posted in Books, Statistics with tags , , , , , , , , , , , on December 13, 2018 by xi'an

In the latest September issue of JASA I received a few days ago, I spotted a review paper by Jaewoo Park & Murali Haran on intractable normalising constants Z(θ). There have been many proposals for solving this problem as well as several surveys, some conferences and even a book. The current survey focus on MCMC solutions, from auxiliary variable approaches to likelihood approximation algorithms (albeit without ABC entries, even though the 2006 auxiliary variable solutions of Møller et al. et of Murray et al. do simulate pseudo-observations and hence…). This includes the MCMC approximations to auxiliary sampling proposed by Faming Liang and co-authors across several papers. And the paper Yves Atchadé, Nicolas Lartillot and I wrote ten years ago on an adaptive MCMC targeting Z(θ) and using stochastic approximation à la Wang-Landau. Park & Haran stress the relevance of using sufficient statistics in this approach towards fighting computational costs, which makes me wonder if an ABC version could be envisioned.  The paper also includes pseudo-marginal techniques like Russian Roulette (once spelled Roullette) and noisy MCMC as proposed in Alquier et al.  (2016). These methods are compared on three examples: (1) the Ising model, (2) a social network model, the Florentine business dataset used in our original paper, and a larger one where most methods prove too costly, and (3) an attraction-repulsion point process model. In conclusion, an interesting survey, taking care to spell out the calibration requirements and the theoretical validation, if of course depending on the chosen benchmarks.

## Bayesian goodness of fit

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , on April 10, 2018 by xi'an

Persi Diaconis and Guanyang Wang have just arXived an interesting reflection on the notion of Bayesian goodness of fit tests. Which is a notion that has always bothered me, in a rather positive sense (!), as

“I also have to confess at the outset to the zeal of a convert, a born again believer in stochastic methods. Last week, Dave Wright reminded me of the advice I had given a graduate student during my algebraic geometry days in the 70’s :`Good Grief, don’t waste your time studying statistics. It’s all cookbook nonsense.’ I take it back! …” David Mumford

The paper starts with a reference to David Mumford, whose paper with Wu and Zhou on exponential “maximum entropy” synthetic distributions is at the source (?) of this paper, and whose name appears in its very title: “A conversation for David Mumford”…, about his conversion from pure (algebraic) maths to applied maths. The issue of (Bayesian) goodness of fit is addressed, with card shuffling examples, the null hypothesis being that the permutation resulting from the shuffling is uniformly distributed if shuffling takes enough time. Interestingly, while the parameter space is compact as a distribution on a finite set, Lindley’s paradox still occurs, namely that the null (the permutation comes from a Uniform) is always accepted provided there is no repetition under a “flat prior”, which is the Dirichlet D(1,…,1) over all permutations. (In this finite setting an improper prior is definitely improper as it does not get proper after accounting for observations. Although I do not understand why the Jeffreys prior is not the Dirichlet(½,…,½) in this case…) When resorting to the exponential family of distributions entertained by Zhou, Wu and Mumford, including the uniform distribution as one of its members, Diaconis and Wang advocate the use of a conjugate prior (exponential family, right?!) to compute a Bayes factor that simplifies into a ratio of two intractable normalising constants. For which the authors suggest using importance sampling, thermodynamic integration, or the exchange algorithm. Except that they rely on the (dreaded) harmonic mean estimator for computing the Bayes factor in the following illustrative section! Due to the finite nature of the space, I presume this estimator still has a finite variance. (Remark 1 calls for convergence results on exchange algorithms, which can be found I think in the just as recent arXival by Christophe Andrieu and co-authors.) An interesting if rare feature of the example processed in the paper is that the sufficient statistic used for the permutation model can be directly simulated from a Multinomial distribution. This is rare as seen when considering the benchmark of Ising models, for which the summary and sufficient statistic cannot be directly simulated. (If only…!) In fine, while I enjoyed the paper a lot, I remain uncertain as to its bearings, since defining an objective alternative for the goodness-of-fit test becomes quickly challenging outside simple enough models.

## probabilities larger than one…

Posted in Statistics with tags , , , , , , on November 9, 2017 by xi'an

## oxwasp@amazon.de

Posted in Books, Kids, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , on April 12, 2017 by xi'an

The reason for my short visit to Berlin last week was an OxWaSP (Oxford and Warwick Statistics Program) workshop hosted by Amazon Berlin with talks between statistics and machine learning, plus posters from our second year students. While the workshop was quite intense, I enjoyed very much the atmosphere and the variety of talks there. (Just sorry that I left too early to enjoy the social programme at a local brewery, Brauhaus Lemke, and the natural history museum. But still managed nice runs east and west!) One thing I found most interesting (if obvious in retrospect) was the different focus of academic and production talks, where the later do not aim at a full generality or at a guaranteed improvement over the existing, provided the new methodology provides a gain in efficiency over the existing.

This connected nicely with me reading several Nature articles on quantum computing during that trip,  where researchers from Google predict commercial products appearing in the coming five years, even though the technology is far from perfect and the outcome qubit error prone. Among the examples they provided, quantum simulation (not meaning what I consider to be simulation!), quantum optimisation (as a way to overcome multimodality), and quantum sampling (targeting given probability distributions). I find the inclusion of the latest puzzling in that simulation (in that sense) shows very little tolerance for errors, especially systematic bias. It may be that specific quantum architectures can be designed for specific probability distributions, just like some are already conceived for optimisation. (It may even be the case that quantum solutions are (just next to) available for intractable constants as in Ising or Potts models!)