Archive for quicksort

Le Monde puzzle [#855]

Posted in Books, Kids, Statistics with tags , , , , , , on March 7, 2014 by xi'an

A Le Monde mathematical puzzle that reminds me of an earlier one:

Given ten tokens with different unknown weights, and a scale that can rank three tokens at a time, starting with ranking three tokens, what is the minimum number of actions necessary to rank the ten of them if (a) one token at a time is added, (b) one or two tokens are added? If no restriction is imposed on the token introduction, is there a more efficient strategy?

It indeed relates to earlier posts on sorting and ternary sorting. Checking further on StackOverflow I found this reply to a question about ternary sorting:

Average number of comparisons:

in ternary search = ((1/3)*1+(2/3)*2)*ln(n)/ln(3)~1.517*ln(n)
in binary search  =                 1*ln(n)/ln(2)~1.443*ln(n)

Worst number of comparisons:

in ternary search = 2*ln(n)/ln(3)~1.820*ln(n)
in binary search  = 1*ln(n)/ln(2)~1.443*ln(n)

albeit with no reference. So this somewhat answers part (c) of the question. (If asymptotically since this does not work for n=10! Even for n=100, it is way too large.) Looking for a solution to part (a), I looked at the performances of a dyadic sorting algorithm, partitioning recursively the already sorted part of the sample into three parts to locate each time the proper position of the new token. This led me to the following code

rang=function(x,ranj,taken){

  mag=max(ranj)-min(ranj)+1
  i1=ranj[1]-1+trunc(mag/3)
  i2=ranj[1]-1+trunc(2*mag/3)
  locrk=rank(c(taken[c(i1,i2)],x))

  if (locrk[3]==1) return(ifelse(i1>3,
     1+rang(x,ranj=1:(i1-1),taken),
     1+(i1>1)))
  if (locrk[3]==2) return(ifelse(i2-i1>4,
     1+rang(x,ranj=(i1+1):(i2-1),taken),ranj=1:(i1-1),taken),
     1+(i1>1)))
  if (locrk[3]==3) return(ifelse(mag-i2>2,
     1+rang(x,ranj=(i2+1):mag,taken),ranj=1:(i1-1),taken),
     1+(i1>1)))
}

ternone=function(toke){

toke[1:3]=sort(toke[1:3])
counter=1
for (ul in (4:length(toke))){

  counter=counter+rang(toke[ul],1:(ul-1),toke)
  toke[1:ul]=sort(toke[1:ul])
  }

return(counter)
}

ternone(sample(1:10))

which produces a minimum of eight (of course!) and a maximum of 20 uses of the scale (based on 10⁶ random permutations of (1,…,10)). Unsurprisingly, the solution proposed in Le Monde the week after does better as it obtains 16 as the worst case figure. Repeating the experiment with n=100 values, the maximum was 303 (with a mean of 270 and a minimum of 240 ternary comparisons).

Moving to an insertion two tokens at a time, I tested a scheme where two new tokens were tested against the current median, then the median of one half, and so on until they split on both sides of this median. Here is the corresponding R code

ring=function(x,y,ranj,taken){

mag=max(ranj)-min(ranj)+1
i1=ranj[1]-1+trunc(mag/2)
locrk=rank(c(taken[i1],x,y))

if (locrk[1]==3) return(ifelse(i1>2,
    1+ring(x,y,ranj=1:(i1-1),taken),
    1+(i1==2)))
if (locrk[1]==2) return(ifelse(mag>4,
    1+rang(min(x,y),ranj=min(ranj):(i1-1),taken)
    +rang(max(x,y),(i1+1):max(ranj),taken),1))
if (locrk[1]==1) return(ifelse(mag-i1>2,
    1+ring(x,y,ranj=(i1+1):mag,taken),
    1+(mag-i1>1)))
}

terntwo=function(toke){

toke[1:3]=sort(toke[1:3])
counter=1+rang(toke[4],1:3,toke)
toke[1:4]=sort(toke[1:4])

for (ul in -1+2*(3:(length(toke)/2))){

  counter=counter+ring(toke[ul],toke[ul+1],1:(ul-1),toke)
  toke[1:ul]=sort(toke[1:ul])
  ul=ul+1
  }

return(counter)
}

leading to a value of 13.

This Feb. 19 issue of Le Monde Science&Médecine leaflet also contains a two page report on growing psychological work-related issues like stress, burn-out and depression, in research labs, mostly connected with the increase in administrative duties and grant writing, duties most of us have not been trained for. There is also a short column by Etienne Gys telling about the short lived claim by Moukhtarbaï Otelbaev to have solved the Navier-Stokes problem. Until Terry Tao wrote a paper stating why the proof could not work.

Le Monde puzzle [#825]

Posted in Books, Kids, pictures, R with tags , , , , , on June 19, 2013 by xi'an

Yet another puzzle which first part does not require R programming, even though it is a programming question in essence:

Given five real numbers x1,…,x5, what is the minimal number of pairwise comparisons needed to rank them? Given 33 real numbers, what is the minimal number of pairwise comparisons required to find the three largest ones?

I do not see a way out of considering the first question as the quickest possible sorting of a real sample. Using either quicksort or heapsort, I achieve sorting the 5 numbers in exactly 6 comparisons for any order of the initial sample. (Now, there may be an even faster way based on comparing partial sums first… I just do not see how!) Update: Oops! I realised I made my reasoning based on a reasonable case, the correct answer is indeed 7!!!

For the second part, let us start from the remark that 32 comparisons are needed to find the largest number, then at most 31 for the second largest, and at most 30 for the third largest (since we can take advantage of the partial ordering resulting from the determination of the largest number). This is poor. If I instead use a heap algorithm, I need O(n log{n}) comparisons to build this binary tree whose parents are always larger than their siblings, as in the above example. (I can produce a sort of heap structure, although non-binary, in an average 16×2.5=40 steps. And a maximum 16×3=48 steps.) The resulting tree provides the largest number (100 in the above example) and at least the second largest number (36 in the above). To get the third largest number, I first need a comparison between the one-before-last terms of the heap (19 vs. 36 in the above), and one or two extra comparisons (25 vs. 19 and maybe 25 vs. 1 in the above). (This would induce an average 1.5 extra comparison and a maximum 2 extra comparisons, resulting in a total of 41.5 average and 49.5 maximum comparisons with my sub-optimal heap construction.)  Once again, using comparisons of sums may help in speeding up the process, for instance comparing numbers by groups of 3, but I did not pursue this solution…

quick3If instead I try to adapt quicksort to this problem, I can have a dynamic pivot that always keep at most two terms above it, providing the three numbers as a finale result. Here is an R code to check its performances:

quick3=function(x){

comp=0
i=1
lower=upper=NULL
pivot=x[1]

for (i in 2:length(x)){

if (x[i]<pivot){ lower=c(lower,x[i])
}else{

upper=c(upper,x[i])
if (length(upper)>1) comp=comp+1}

comp=comp+1

if (length(upper)==3){

pivot=min(upper)
upper=sort(upper)[-1]
}}

if (length(upper)<3) upper=c(pivot,upper)

list(top=upper,cost=comp)
}

When running this R code on 10⁴ random sequences of 33 terms, I obtained the following statistics, I obtained the following statistics on the computing costs

</p>
<p style="text-align: justify;">> summary(costs)
Min.   1st Qu.  Median  Mean    3rd Qu.  Max.
32.00   36.00   38.00   37.89   40.00   49.00

and the associated histogram represented above. Interestingly, the minimum is the number of comparisons needed to produce the maximum!

Reading the solution in Le Monde in the train to London and Bayes 250, I discovered that the author suggests a 7 comparison solution in the first case [compare A and B, C and D = 2 comparisons; if A>B and C>D, compare A and C and get, say A>C>D = 1 comparison; insert E in this ordering by comparing with C and then A or D = 2 comparisons, obtaining e.g. A>E>C>D; conclude by inserting B by first comparing with C then with D or E = 2 comparisons] and a 41 comparison solution in the second case. He is (or I am!) I was clearly mistaken in the first case while the quick3 algorithm does 41 or less most of the time (90%)  but not always.

mad statistic

Posted in R, Statistics, University life with tags , , , , , on April 30, 2012 by xi'an

In the motivating toy example to our ABC model choice paper, we compare summary statistics, mean, median, variance, and… median absolute deviation (mad). The latest is the only one able to discriminate between our normal and Laplace models (as now discussed on Cross Validated!). When rerunning simulations to produce nicer graphical outcomes (for the revision), I noticed a much longer run time associated with the computation of the mad statistic. Here is a comparison for the computation of the mean, median, and mad on identical simulations:

> system.time(mmean(10^5))
   user  system elapsed
  4.040   0.056   4.350
> system.time(mmedian(10^5))
user  system elapsed
12.509   0.012  15.353
> system.time(mmad(10^5))
   user  system elapsed
 23.345   0.036  23.458

Now, this is not particularly surprising: computing a median takes longer than computing a mean, even using quicksort!, hence computing two medians… Still, having to wait about six times longer for the delivery of a mad statistics is somehow…mad!

Ternary sorting

Posted in R with tags , , , , , on July 25, 2011 by xi'an

The last Le Monde puzzle made me wonder about the ternary version of the sorting algorithms, which all seem to be binary (compare x and y, then…). The problem is, given (only) a blackbox procedure that returns the relative order of three arbitrary numbers, how many steps are necessary to sort a series of n nnumbers? The heapsort entry in Wikipedia mentions a ternary sorting version, but does not get into details. Robert Sedgewick (author of a fabulous book on algorithmic I enjoyed very much when I started programming) has a talk about the optimality of quicksort where he mentions ternary sorting, but this seems to be more related with ties than with my problem… It is of course highly formal in that I do not know of a physical device that would justify moving from binary to ternary comparisons.

Le Monde puzzle [#28]

Posted in R with tags , , , , on July 22, 2011 by xi'an

The puzzle of last weekend in Le Monde was about finding the absolute rank of x9 when given the relative ranks of x1,….,x8 and the possibility to ask for relative ranks of three numbers at a time. In R terms, this means being able to use

> rank(x[-9])
[1] 1 7 4 6 8 3 2 5
> rank(x[1:3])
[1] 1 3 2

or yet being able to sort the first 8 components of x

> x[-9]=sort(x[-9])
> rank(x[c(1,8,9)])
[1] 1 3 2

and then use rank() over triplets to position x9 in the list. If x9 is an extreme value, calling for its rank relative to x1 and x8  is sufficient to find its position. Else repeating with the rank relative to x2 and x7, then relative to x3 and x6  , etc., produces the absolute rank of x9 in at most 4 steps. However, if we first get the rank relative to x1 and x8,  then the rank relative to x4 and x5, we only need at most an extra step to find the absolute rank of x9 in thus at most 3 steps. Yet however, if we start with the rank relative to x3 and x6, we are left with a single rank call to determine the absolute rank of x9 since, whatever the position of x9 relative to x3 and x6, there are only two remaining numbers to compare x9 with. So we can find the absolute rank of x9 in exactly two calls to rank.

The second part of the puzzle is to determine for an unknown series x of 80 numbers the maximum number of calls to rank(c(x1,x2,x3)) to obtain rank(x). Of course, the dumb solution is to check all 82160 triplets, but I presume a sort of bubblesort would do better, even though Wikipedia tells me that bubble sort has worst-case and average complexity both О(n2), while quicksort has an average (if not worse-case) O(n log(n)) complexity…. If we follow the one-step procedure obtained in the first part to insert a new number, starting with three numbers whose relative rank is obtained with one  single call, we need in toto

1 + (9-3)*2 + ((1+3*8+2)-9)*3+(80-27)*4 =279

calls to rank since 80<1+3*(3*8+2)+2=81. (This is also the solution proposed in Le Monde, however there is no proof inserting one number at a time is the optimal solution.)

Follow

Get every new post delivered to your Inbox.

Join 598 other followers