## MCMC for conditional Bernoullis

Posted in Books, Statistics, University life with tags , , , , , on February 22, 2021 by xi'an

Jeremy Heng, Pierre Jacob [currently in Paris!] and Nianqiao Ju are in a recent arXival considering the simulation of a conditional Bernoulli, namely generating a vector of N Bernoullis with different probabilities under the constraint that their sum is fixed as I. Rather than going for a perfect simulator, with cost O(NI), they opt for the simplest of MCMC samplers, where a 0 and a 1 entries are exchanged at random. In connection with a recent spate of MCMC works using coupling, they establish convergence in O(N log N) steps, even when the probabilities are arbitrarily close to zero and one. Including the case when they are Uniformly generated. From a mundane perspective, I wonder at the appeal of using the probabilities to select the exchange pair. I realise sorting the probabilities is already of order O(N log N) avoiding selecting highly probable 1’s and highly probable 0’s should speed up converge, unless the gain is negligible. And to link MCMC and exact simulation in this setting, what would the cost of perfect sampling by sampling from the past be? Presumably much higher since there is little chance a total ordering can be found on the starting states.

## Le Monde puzzle [#1110]

Posted in Books, Kids, R, Travel with tags , , , , , , , , , on September 16, 2019 by xi'an

A low-key sorting problem as Le Monde current mathematical puzzle:

If the numbers from 1 to 67 are randomly permuted and if the sorting algorithm consists in picking a number i with a position higher than its rank i and moving it at the correct i-th position, what is the maximal number of steps to sort this set of integers when the selected integer is chosen optimaly?

As the question does not clearly state what happens to the number j that stood in the i-th position, I made the assumption that the entire sequence from position i to position n is moved one position upwards (rather than having i and j exchanged). In which case my intuition was that moving the smallest moveable number was optimal, resulting in the following R code

```sorite<-function(permu){ n=length(permu) p=0 while(max(abs(permu-(1:n)))>0){
j=min(permu[permu<(1:n)])
p=p+1
permu=unique(c(permu[(1:n)<j],j,permu[j:n]))}
return(p)}
```

which takes at most n-1 steps to reorder the sequence. I however checked this insertion sort was indeed the case through a recursive function

```resorite<-function(permu){ n=length(permu);p=0 while(max(abs(permu-(1:n)))>0){
j=cand=permu[permu<(1:n)]
if (length(cand)==1){
p=p+1
permu=unique(c(permu[(1:n)<j],j,permu[j:n]))
}else{
sol=n^3
for (i in cand){
qermu=unique(c(permu[(1:n)<i],i,permu[i:n]))
rol=resorite(qermu)
if (rol<sol)sol=rol}
p=p+1+sol;break()}}
return(p)}
```

which did confirm my intuition.

## Le Monde puzzle [#855]

Posted in Books, Kids, Statistics with tags , , , , , , on March 7, 2014 by xi'an

A Le Monde mathematical puzzle that reminds me of an earlier one:

Given ten tokens with different unknown weights, and a scale that can rank three tokens at a time, starting with ranking three tokens, what is the minimum number of actions necessary to rank the ten of them if (a) one token at a time is added, (b) one or two tokens are added? If no restriction is imposed on the token introduction, is there a more efficient strategy?

It indeed relates to earlier posts on sorting and ternary sorting. Checking further on StackOverflow I found this reply to a question about ternary sorting:

Average number of comparisons:

``````in ternary search = ((1/3)*1+(2/3)*2)*ln(n)/ln(3)~1.517*ln(n)
in binary search  =                 1*ln(n)/ln(2)~1.443*ln(n)
``````

Worst number of comparisons:

``````in ternary search = 2*ln(n)/ln(3)~1.820*ln(n)
in binary search  = 1*ln(n)/ln(2)~1.443*ln(n)``````

albeit with no reference. So this somewhat answers part (c) of the question. (If asymptotically since this does not work for n=10! Even for n=100, it is way too large.) Looking for a solution to part (a), I looked at the performances of a dyadic sorting algorithm, partitioning recursively the already sorted part of the sample into three parts to locate each time the proper position of the new token. This led me to the following code

```rang=function(x,ranj,taken){

mag=max(ranj)-min(ranj)+1
i1=ranj[1]-1+trunc(mag/3)
i2=ranj[1]-1+trunc(2*mag/3)
locrk=rank(c(taken[c(i1,i2)],x))

if (locrk[3]==1) return(ifelse(i1>3,
1+rang(x,ranj=1:(i1-1),taken),
1+(i1>1)))
if (locrk[3]==2) return(ifelse(i2-i1>4,
1+rang(x,ranj=(i1+1):(i2-1),taken),ranj=1:(i1-1),taken),
1+(i1>1)))
if (locrk[3]==3) return(ifelse(mag-i2>2,
1+rang(x,ranj=(i2+1):mag,taken),ranj=1:(i1-1),taken),
1+(i1>1)))
}

ternone=function(toke){

toke[1:3]=sort(toke[1:3])
counter=1
for (ul in (4:length(toke))){

counter=counter+rang(toke[ul],1:(ul-1),toke)
toke[1:ul]=sort(toke[1:ul])
}

return(counter)
}

ternone(sample(1:10))
```

which produces a minimum of eight (of course!) and a maximum of 20 uses of the scale (based on 10⁶ random permutations of (1,…,10)). Unsurprisingly, the solution proposed in Le Monde the week after does better as it obtains 16 as the worst case figure. Repeating the experiment with n=100 values, the maximum was 303 (with a mean of 270 and a minimum of 240 ternary comparisons).

Moving to an insertion two tokens at a time, I tested a scheme where two new tokens were tested against the current median, then the median of one half, and so on until they split on both sides of this median. Here is the corresponding R code

```ring=function(x,y,ranj,taken){

mag=max(ranj)-min(ranj)+1
i1=ranj[1]-1+trunc(mag/2)
locrk=rank(c(taken[i1],x,y))

if (locrk[1]==3) return(ifelse(i1>2,
1+ring(x,y,ranj=1:(i1-1),taken),
1+(i1==2)))
if (locrk[1]==2) return(ifelse(mag>4,
1+rang(min(x,y),ranj=min(ranj):(i1-1),taken)
+rang(max(x,y),(i1+1):max(ranj),taken),1))
if (locrk[1]==1) return(ifelse(mag-i1>2,
1+ring(x,y,ranj=(i1+1):mag,taken),
1+(mag-i1>1)))
}

terntwo=function(toke){

toke[1:3]=sort(toke[1:3])
counter=1+rang(toke[4],1:3,toke)
toke[1:4]=sort(toke[1:4])

for (ul in -1+2*(3:(length(toke)/2))){

counter=counter+ring(toke[ul],toke[ul+1],1:(ul-1),toke)
toke[1:ul]=sort(toke[1:ul])
ul=ul+1
}

return(counter)
}
```

leading to a value of 13.

This Feb. 19 issue of Le Monde Science&Médecine leaflet also contains a two page report on growing psychological work-related issues like stress, burn-out and depression, in research labs, mostly connected with the increase in administrative duties and grant writing, duties most of us have not been trained for. There is also a short column by Etienne Gys telling about the short lived claim by Moukhtarbaï Otelbaev to have solved the Navier-Stokes problem. Until Terry Tao wrote a paper stating why the proof could not work.

## Le Monde puzzle [#825]

Posted in Books, Kids, pictures, R with tags , , , , , on June 19, 2013 by xi'an

Yet another puzzle which first part does not require R programming, even though it is a programming question in essence:

Given five real numbers x1,…,x5, what is the minimal number of pairwise comparisons needed to rank them? Given 33 real numbers, what is the minimal number of pairwise comparisons required to find the three largest ones?

I do not see a way out of considering the first question as the quickest possible sorting of a real sample. Using either quicksort or heapsort, I achieve sorting the 5 numbers in exactly 6 comparisons for any order of the initial sample. (Now, there may be an even faster way based on comparing partial sums first… I just do not see how!) Update: Oops! I realised I made my reasoning based on a reasonable case, the correct answer is indeed 7!!!

For the second part, let us start from the remark that 32 comparisons are needed to find the largest number, then at most 31 for the second largest, and at most 30 for the third largest (since we can take advantage of the partial ordering resulting from the determination of the largest number). This is poor. If I instead use a heap algorithm, I need O(n log{n}) comparisons to build this binary tree whose parents are always larger than their siblings, as in the above example. (I can produce a sort of heap structure, although non-binary, in an average 16×2.5=40 steps. And a maximum 16×3=48 steps.) The resulting tree provides the largest number (100 in the above example) and at least the second largest number (36 in the above). To get the third largest number, I first need a comparison between the one-before-last terms of the heap (19 vs. 36 in the above), and one or two extra comparisons (25 vs. 19 and maybe 25 vs. 1 in the above). (This would induce an average 1.5 extra comparison and a maximum 2 extra comparisons, resulting in a total of 41.5 average and 49.5 maximum comparisons with my sub-optimal heap construction.)  Once again, using comparisons of sums may help in speeding up the process, for instance comparing numbers by groups of 3, but I did not pursue this solution…

If instead I try to adapt quicksort to this problem, I can have a dynamic pivot that always keep at most two terms above it, providing the three numbers as a finale result. Here is an R code to check its performances:

```quick3=function(x){

comp=0
i=1
lower=upper=NULL
pivot=x[1]

for (i in 2:length(x)){

if (x[i]<pivot){ lower=c(lower,x[i])
}else{

upper=c(upper,x[i])
if (length(upper)>1) comp=comp+1}

comp=comp+1

if (length(upper)==3){

pivot=min(upper)
upper=sort(upper)[-1]
}}

if (length(upper)<3) upper=c(pivot,upper)

list(top=upper,cost=comp)
}
```

When running this R code on 10⁴ random sequences of 33 terms, I obtained the following statistics, I obtained the following statistics on the computing costs

```</p>
<p style="text-align: justify;">> summary(costs)
Min.   1st Qu.  Median  Mean    3rd Qu.  Max.
32.00   36.00   38.00   37.89   40.00   49.00
```

and the associated histogram represented above. Interestingly, the minimum is the number of comparisons needed to produce the maximum!

Reading the solution in Le Monde in the train to London and Bayes 250, I discovered that the author suggests a 7 comparison solution in the first case [compare A and B, C and D = 2 comparisons; if A>B and C>D, compare A and C and get, say A>C>D = 1 comparison; insert E in this ordering by comparing with C and then A or D = 2 comparisons, obtaining e.g. A>E>C>D; conclude by inserting B by first comparing with C then with D or E = 2 comparisons] and a 41 comparison solution in the second case. He is (or I am!) I was clearly mistaken in the first case while the quick3 algorithm does 41 or less most of the time (90%)  but not always.

Posted in R, Statistics, University life with tags , , , , , on April 30, 2012 by xi'an

In the motivating toy example to our ABC model choice paper, we compare summary statistics, mean, median, variance, and… median absolute deviation (mad). The latest is the only one able to discriminate between our normal and Laplace models (as now discussed on Cross Validated!). When rerunning simulations to produce nicer graphical outcomes (for the revision), I noticed a much longer run time associated with the computation of the mad statistic. Here is a comparison for the computation of the mean, median, and mad on identical simulations:

```> system.time(mmean(10^5))
user  system elapsed
4.040   0.056   4.350
> system.time(mmedian(10^5))
user  system elapsed
12.509   0.012  15.353