## Archive for Monte Carlo

Posted in R, Statistics, University life with tags , , , , , , , , , on January 5, 2011 by xi'an

Yves Atchadé presented a very recent work on the fundamental issue of estimating the asymptotic variance estimation for adaptive MCMC algorithms, with an intriguing experimental observation that a non-converging bandwidth with rate 1/n was providing better coverage than the converging rate. (I always found the issue of estimating the asymptotic variance both a tough problem and an important item in convergence assessment.) Galin Jones showed new regeneration results for componentwise MCMC samplers, with applications to quantile estimation. The iid structure produced by the regeneration mechanism allows rather naturally to introduce an adaptive improvement in those algorithms, if regeneration occurs often enough. (From the days of my Stat’Sci’ paper on convergence assessment, I  love regeneration techniques for both theoretical and methodological reasons, even though they are often difficult to efficiently implement in practice.) Matti Vihola summarised several of his recent papers on the stability and convergence of adaptive MCMC algorithms, pursuing the Finnish tradition of leadership in adaptive algorithms! One point I found particularly interesting was the possibility of separating ergodicity from the Law of Large Numbers, thus reducing the constraints imposed by the containment condition. In the afternoon, Dawn Woodard discussed the convergence rate of the Gibbs sampler used for genomic motif discovery by Liu, Lawrence and Neuwald (1995). Scott Schmidler concluded the workshop by a far-ranging talk distinguishing between exploration and exploitation in adaptive MCMC algorithms, ie mixing vs burning, with illustrations using the Wang-Landau algorithm.

Thus, as in the previous editions of Adap’ski, we have had a uniformly high quality of talks about the current research in the area of adaptive algorithms (and a wee further). This shows the field is very well active and expanding, aiming at reaching a wider audience by providing verifiable convergence conditions and semi-automated softwares (like Jeff Rosenthal’s amcmc R code we used in Introducing Monte Carlo Methods with R). Looking forward Adap’ski 4 (Adap’skiV?!), hopefully in Europe and why not in Chamonix?! Which could then lead us to call the next meeting Adap’skiX…

## Random sudokus [p-values]

Posted in R, Statistics with tags , , , , , , , on May 21, 2010 by xi'an

I reran the program checking the distribution of the digits over 9 “diagonals” (obtained by acceptable permutations of rows and column) and this test again results in mostly small p-values. Over a million iterations, and the nine (dependent) diagonals, four p-values were below 0.01, three were below 0.1, and two were above (0.21 and 0.42). So I conclude in a discrepancy between my (full) sudoku generator and the hypothesised distribution of the (number of different) digits over the diagonal. Assuming my generator is a faithful reproduction of the one used in the paper by Newton and DeSalvo, this discrepancy suggests that their distribution over the sudoku grids do not agree with this diagonal distribution, either because it is actually different from uniform or, more likely, because the uniform distribution I use over the (groups of three over the) diagonal is not compatible with a uniform distribution over all sudokus…

## ACM Transactions on Modeling and Computer Simulation

Posted in Books, R, Statistics, University life with tags , , , on May 21, 2010 by xi'an

Pierre Lecuyer is the new editor of the ACM Transactions on Modeling and Computer Simulation (TOMACS) and he has asked me to become an Area Editor for the new area of simulation in Statistics. I am quite excited by this new Æditor’s hat, since this is a cross-disciplinary journal:

The ACM Transactions on Modeling and Computer Simulation (TOMACS) provides a single archival source for the publication of high-quality research and developmental results in computer simulation. The subjects of emphasis are discrete event simulation, combined discrete and continuous simulation, as well as Monte Carlo methods. Papers in continuous simulation will also receive serious consideration if their contributions to modeling and simulation in general are substantial.

The use of simulation techniques is pervasive, extending to virtually all the sciences. TOMACS serves to enhance the understanding, improve the practice, and increase the utilization of computer simulation. Submissions should contribute to the realization of these objectives, and papers treating applications should stress their contributions vis-a-vis these objectives.

As an indication of this cross-disciplinarity, I note that most Area Editors and Associate Editors are unknown to me (except for Luc Devroye, of course!). In addition, I savour the irony of being associated with a journal of the Association for Computer Machinery (ACM), given my complete lack of practical skills! So, if you have relevant papers to submit in the field, please consider the ACM Transactions on Modeling and Computer Simulation (TOMACS) as a possible outlet.

## Random [uniform?] sudokus [corrected]

Posted in R, Statistics with tags , , , , , on May 19, 2010 by xi'an

As the discrepancy [from 1] in the sum of the nine probabilities seemed too blatant to be attributed to numerical error given the problem scale, I went and checked my R code for the probabilities and found a choose(9,3) instead of a choose(6,3) in the last line… The fit between the true distribution and the observed frequencies is now much better

but the chi-square test remains suspicious of the uniform assumption (or again of my programming abilities):

> chisq.test(obs,p=pdiag)
Chi-squared test for given probabilities
data:  obs
X-squared = 16.378, df = 6, p-value = 0.01186

since a p-value of 1% is a bit in the far tail of the distribution.

## Random [uniform?] sudokus

Posted in R, Statistics with tags , , , , , , on May 19, 2010 by xi'an

A longer run of the R code of yesterday with a million sudokus produced the following qqplot.

It does look ok but no perfect. Actually, it looks very much like the graph of yesterday, although based on a 100-fold increase in the number of simulations. Now, if I test the adequation with a basic chi-square test (!), the result is highly negative:

> chisq.test(obs,p=pdiag/sum(pdiag)) #numerical error in pdiag
Chi-squared test for given probabilities
data:  obs
X-squared = 6978.503, df = 6, p-value < 2.2e-16

(there are seven entries for both obs and pdiag, hence the six degrees of freedom). So this casts a doubt upon the uniformity of the random generator suggested in the paper by Newton and DeSalvo or rather on my programming abilities, see next post!

## Random sudokus [test]

Posted in R, Statistics with tags , , , , , , on May 18, 2010 by xi'an

Robin Ryder pointed out to me that 3 is indeed the absolute minimum one could observe because of the block constraint (bon sang, mais c’est bien sûr !). The distribution of the series of 3 digits being independent over blocks, the theoretical distribution under uniformity can easily be simulated:

#uniform distribution on the block diagonal
sheik=rep(0,9)
for (t in 1:10^6){
group=length(unique(c(sample(1:9,3),sample(1:9,3),sample(1:9,3))))
sheik[group]=sheik[group]+1
}

and it produces a result that is close enough to the one observed with the random sudoku generator. Actually, the exact distribution is available as (corrected on May 19!)

pdiag=c(1, #k=3
(3*6+3*6*4), #k=4
(3*choose(6,2)+3*6*5*choose(4,2)+3*choose(5,3)*choose(6,2)), #k=5
(choose(6,3)+3*6*choose(5,2)*4+3*choose(6,2)*choose(5,2)*4+
choose(6,3)*choose(6,3)),#k=6
(choose(3,2)*6*choose(5,3)+3*choose(6,2)*choose(4,2)*5+
choose(6,3)*choose(6,2)*3), #k=7
(3*choose(6,2)*4+choose(6,3)*6*choose(3,2)), #k=8
choose(6,3))/choose(9,3)^2 #k=9
choose(9,6))/choose(9,3)^2 #k=9

hence a better qq-plot:

## Random sudokus

Posted in R, Statistics with tags , , , , , on May 17, 2010 by xi'an

After thinking about random sudokus for a few more weeks, I eventually came to read the paper by Newton and DeSalvo about the entropy of sudoku matrices. As written earlier, if we consider (as Newton and DeSakvo) a uniform distribution where the sudokus are drawn uniformly over the set of all sudokus, the entropy of this distribution is $\log(N)$, where N is the size of this set of all sudokus. The (Shannon) entropy considered in the paper is completely different: it is

$H = -\sum_{i=1}^9 \hat \sigma_i \log(\hat \sigma_i)$

where the $\hat \sigma_i$‘s are the normalised transforms of the singular values of the sudoku matrix. The result that the average H is about 1.733, being larger than the same average over a collection of purely random matrices, 1.651, is then much less spectacular…

Another entry in the same paper is worth pondering about: the authors consider a random selection mechanism, working one entry at a time from the first row downwards, with a backward step in case of impossible entries. An R rendering of this generator is

s=matrix(0,ncol=9,nrow=9)
pool=rep(TRUE,9)
s[1,]=sample(1:9,9)
i=2;
while (i<10){
boxa=3*trunc((i-1)/3)+1;boxa=boxa:(boxa+2)
for (j in 1:9){
boxb=3*trunc((j-1)/3)+1;boxb=boxb:(boxb+2)
for (u in (1:9))
pool[u]=(sum(u==s[i,])+sum(u==s[,j]) +sum(u==s[boxa,boxb]))==0
if (sum(pool)>1){
s[i,j]=sample((1:9)[pool],1)
}else{
if (sum(pool)==1)
s[i,j]=(1:9)[pool]
if (sum(pool)==0){
rmrk=sample(0:(i-1),1,prob=1/(1:i)^9)
s[(i-rmrk):i, ]=0
i=i-rmrk-1
break()

}}}
i=i+1
}

and it produces valid sudokus! The question is rather to test how uniform this generator is (the authors replace uniformity with unbiasedness, resulting in a simplistic necessary condition on the generator!) and the paper is only considering an answer based on the average of the s[i,j]‘s, which are naturally equidistributed over 1,2,…,9! While this answer is unsatisfactory, I have no clear idea on how to attack the problem. For instance, I ran an Rperiment looking for the sequence 1,2,…,9 to appear in either a row or a column: I think the probability of occurrence of this sequence is 1/8! (since there is always one row and one column starting with 1), i.e. 2.5 10-5, while 36000 simulations from the above showed a frequency of 5.5 10-5 both for rows and columns (i.e., 2 occurences over the 36000 replicas!). This only shows the need for a much larger experiment (and presumably a move from R to C  if I want the answer quickly!). Another test that is permutation invariant would be to check for the distribution of the number of different digits on the diagonal of the sudoku matrix, but I am not certain about the theoretical distribution. For instance, running 10,000 simulations, the average number of different digits is 6.3, with no occurrence of 1, 2 or 3, and the binomial fit is poor.