## winning entry at MCqMC’16

Posted in Books, Kids, pictures, Statistics, Travel, University life with tags , , , , , , , on August 29, 2016 by xi'an

The nice logo of MCqMC 2016 was a collection of eight series of QMC dots on the unit (?) cube. The organisers set a competition to identify the principles behind those quasi-random sets and as I had no idea for most of them I entered very random sets unconnected with algorithmia, for which I got an honourable mention and a CD prize (if not the conference staff tee-shirt I was coveting!) Art Owen sent me back my entry, posted below and hopefully (or not!) readable.

## Posterior predictive p-values and the convex order

Posted in Books, Statistics, University life with tags , , , , , , , , , on December 22, 2014 by xi'an

Patrick Rubin-Delanchy and Daniel Lawson [of Warhammer fame!] recently arXived a paper we had discussed with Patrick when he visited Andrew and I last summer in Paris. The topic is the evaluation of the posterior predictive probability of a larger discrepancy between data and model

$\mathbb{P}\left( f(X|\theta)\ge f(x^\text{obs}|\theta) \,|\,x^\text{obs} \right)$

which acts like a Bayesian p-value of sorts. I discussed several times the reservations I have about this notion on this blog… Including running one experiment on the uniformity of the ppp while in Duke last year. One item of those reservations being that it evaluates the posterior probability of an event that does not exist a priori. Which is somewhat connected to the issue of using the data “twice”.

“A posterior predictive p-value has a transparent Bayesian interpretation.”

Another item that was suggested [to me] in the current paper is the difficulty in defining the posterior predictive (pp), for instance by including latent variables

$\mathbb{P}\left( f(X,Z|\theta)\ge f(x^\text{obs},Z^\text{obs}|\theta) \,|\,x^\text{obs} \right)\,,$

which reminds me of the multiple possible avatars of the BIC criterion. The question addressed by Rubin-Delanchy and Lawson is how far from the uniform distribution stands this pp when the model is correct. The main result of their paper is that any sub-uniform distribution can be expressed as a particular posterior predictive. The authors also exhibit the distribution that achieves the bound produced by Xiao-Li Meng, Namely that

$\mathbb{P}(P\le \alpha) \le 2\alpha$

where P is the above (top) probability. (Hence it is uniform up to a factor 2!) Obviously, the proximity with the upper bound only occurs in a limited number of cases that do not validate the overall use of the ppp. But this is certainly a nice piece of theoretical work.

## Random sudokus [p-values]

Posted in R, Statistics with tags , , , , , , , on May 21, 2010 by xi'an

I reran the program checking the distribution of the digits over 9 “diagonals” (obtained by acceptable permutations of rows and column) and this test again results in mostly small p-values. Over a million iterations, and the nine (dependent) diagonals, four p-values were below 0.01, three were below 0.1, and two were above (0.21 and 0.42). So I conclude in a discrepancy between my (full) sudoku generator and the hypothesised distribution of the (number of different) digits over the diagonal. Assuming my generator is a faithful reproduction of the one used in the paper by Newton and DeSalvo, this discrepancy suggests that their distribution over the sudoku grids do not agree with this diagonal distribution, either because it is actually different from uniform or, more likely, because the uniform distribution I use over the (groups of three over the) diagonal is not compatible with a uniform distribution over all sudokus…

## Random [uniform?] sudokus [corrected]

Posted in R, Statistics with tags , , , , , on May 19, 2010 by xi'an

As the discrepancy [from 1] in the sum of the nine probabilities seemed too blatant to be attributed to numerical error given the problem scale, I went and checked my R code for the probabilities and found a choose(9,3) instead of a choose(6,3) in the last line… The fit between the true distribution and the observed frequencies is now much better

but the chi-square test remains suspicious of the uniform assumption (or again of my programming abilities):

> chisq.test(obs,p=pdiag)
Chi-squared test for given probabilities
data:  obs
X-squared = 16.378, df = 6, p-value = 0.01186

since a p-value of 1% is a bit in the far tail of the distribution.

## Random [uniform?] sudokus

Posted in R, Statistics with tags , , , , , , on May 19, 2010 by xi'an

A longer run of the R code of yesterday with a million sudokus produced the following qqplot.

It does look ok but no perfect. Actually, it looks very much like the graph of yesterday, although based on a 100-fold increase in the number of simulations. Now, if I test the adequation with a basic chi-square test (!), the result is highly negative:

> chisq.test(obs,p=pdiag/sum(pdiag)) #numerical error in pdiag
Chi-squared test for given probabilities
data:  obs
X-squared = 6978.503, df = 6, p-value < 2.2e-16

(there are seven entries for both obs and pdiag, hence the six degrees of freedom). So this casts a doubt upon the uniformity of the random generator suggested in the paper by Newton and DeSalvo or rather on my programming abilities, see next post!

## Random sudokus [test]

Posted in R, Statistics with tags , , , , , , on May 18, 2010 by xi'an

Robin Ryder pointed out to me that 3 is indeed the absolute minimum one could observe because of the block constraint (bon sang, mais c’est bien sûr !). The distribution of the series of 3 digits being independent over blocks, the theoretical distribution under uniformity can easily be simulated:

#uniform distribution on the block diagonal
sheik=rep(0,9)
for (t in 1:10^6){
group=length(unique(c(sample(1:9,3),sample(1:9,3),sample(1:9,3))))
sheik[group]=sheik[group]+1
}

and it produces a result that is close enough to the one observed with the random sudoku generator. Actually, the exact distribution is available as (corrected on May 19!)

pdiag=c(1, #k=3
(3*6+3*6*4), #k=4
(3*choose(6,2)+3*6*5*choose(4,2)+3*choose(5,3)*choose(6,2)), #k=5
(choose(6,3)+3*6*choose(5,2)*4+3*choose(6,2)*choose(5,2)*4+
choose(6,3)*choose(6,3)),#k=6
(choose(3,2)*6*choose(5,3)+3*choose(6,2)*choose(4,2)*5+
choose(6,3)*choose(6,2)*3), #k=7
(3*choose(6,2)*4+choose(6,3)*6*choose(3,2)), #k=8
choose(6,3))/choose(9,3)^2 #k=9
choose(9,6))/choose(9,3)^2 #k=9

hence a better qq-plot:

## Random sudokus

Posted in R, Statistics with tags , , , , , on May 17, 2010 by xi'an

After thinking about random sudokus for a few more weeks, I eventually came to read the paper by Newton and DeSalvo about the entropy of sudoku matrices. As written earlier, if we consider (as Newton and DeSakvo) a uniform distribution where the sudokus are drawn uniformly over the set of all sudokus, the entropy of this distribution is $\log(N)$, where N is the size of this set of all sudokus. The (Shannon) entropy considered in the paper is completely different: it is

$H = -\sum_{i=1}^9 \hat \sigma_i \log(\hat \sigma_i)$

where the $\hat \sigma_i$‘s are the normalised transforms of the singular values of the sudoku matrix. The result that the average H is about 1.733, being larger than the same average over a collection of purely random matrices, 1.651, is then much less spectacular…

Another entry in the same paper is worth pondering about: the authors consider a random selection mechanism, working one entry at a time from the first row downwards, with a backward step in case of impossible entries. An R rendering of this generator is

s=matrix(0,ncol=9,nrow=9)
pool=rep(TRUE,9)
s[1,]=sample(1:9,9)
i=2;
while (i<10){
boxa=3*trunc((i-1)/3)+1;boxa=boxa:(boxa+2)
for (j in 1:9){
boxb=3*trunc((j-1)/3)+1;boxb=boxb:(boxb+2)
for (u in (1:9))
pool[u]=(sum(u==s[i,])+sum(u==s[,j]) +sum(u==s[boxa,boxb]))==0
if (sum(pool)>1){
s[i,j]=sample((1:9)[pool],1)
}else{
if (sum(pool)==1)
s[i,j]=(1:9)[pool]
if (sum(pool)==0){
rmrk=sample(0:(i-1),1,prob=1/(1:i)^9)
s[(i-rmrk):i, ]=0
i=i-rmrk-1
break()

}}}
i=i+1
}

and it produces valid sudokus! The question is rather to test how uniform this generator is (the authors replace uniformity with unbiasedness, resulting in a simplistic necessary condition on the generator!) and the paper is only considering an answer based on the average of the s[i,j]’s, which are naturally equidistributed over 1,2,…,9! While this answer is unsatisfactory, I have no clear idea on how to attack the problem. For instance, I ran an Rperiment looking for the sequence 1,2,…,9 to appear in either a row or a column: I think the probability of occurrence of this sequence is 1/8! (since there is always one row and one column starting with 1), i.e. 2.5 10-5, while 36000 simulations from the above showed a frequency of 5.5 10-5 both for rows and columns (i.e., 2 occurences over the 36000 replicas!). This only shows the need for a much larger experiment (and presumably a move from R to C  if I want the answer quickly!). Another test that is permutation invariant would be to check for the distribution of the number of different digits on the diagonal of the sudoku matrix, but I am not certain about the theoretical distribution. For instance, running 10,000 simulations, the average number of different digits is 6.3, with no occurrence of 1, 2 or 3, and the binomial fit is poor.