Yet another puzzle which first part does not require R programming, even though it is a programming question in essence:
Given five real numbers x1,…,x5, what is the minimal number of pairwise comparisons needed to rank them? Given 33 real numbers, what is the minimal number of pairwise comparisons required to find the three largest ones?
I do not see a way out of considering the first question as the quickest possible sorting of a real sample. Using either quicksort or heapsort, I achieve sorting the 5 numbers in exactly 6 comparisons for any order of the initial sample. (Now, there may be an even faster way based on comparing partial sums first… I just do not see how!) Update: Oops! I realised I made my reasoning based on a reasonable case, the correct answer is indeed 7!!!
For the second part, let us start from the remark that 32 comparisons are needed to find the largest number, then at most 31 for the second largest, and at most 30 for the third largest (since we can take advantage of the partial ordering resulting from the determination of the largest number). This is poor. If I instead use a heap algorithm, I need O(n log{n}) comparisons to build this binary tree whose parents are always larger than their siblings, as in the above example. (I can produce a sort of heap structure, although non-binary, in an average 16×2.5=40 steps. And a maximum 16×3=48 steps.) The resulting tree provides the largest number (100 in the above example) and at least the second largest number (36 in the above). To get the third largest number, I first need a comparison between the one-before-last terms of the heap (19 vs. 36 in the above), and one or two extra comparisons (25 vs. 19 and maybe 25 vs. 1 in the above). (This would induce an average 1.5 extra comparison and a maximum 2 extra comparisons, resulting in a total of 41.5 average and 49.5 maximum comparisons with my sub-optimal heap construction.) Once again, using comparisons of sums may help in speeding up the process, for instance comparing numbers by groups of 3, but I did not pursue this solution…
If instead I try to adapt quicksort to this problem, I can have a dynamic pivot that always keep at most two terms above it, providing the three numbers as a finale result. Here is an R code to check its performances:
quick3=function(x){
comp=0
i=1
lower=upper=NULL
pivot=x[1]
for (i in 2:length(x)){
if (x[i]<pivot){ lower=c(lower,x[i])
}else{
upper=c(upper,x[i])
if (length(upper)>1) comp=comp+1}
comp=comp+1
if (length(upper)==3){
pivot=min(upper)
upper=sort(upper)[-1]
}}
if (length(upper)<3) upper=c(pivot,upper)
list(top=upper,cost=comp)
}
When running this R code on 10⁴ random sequences of 33 terms, I obtained the following statistics, I obtained the following statistics on the computing costs
</p>
<p style="text-align: justify;">> summary(costs)
Min. 1st Qu. Median Mean 3rd Qu. Max.
32.00 36.00 38.00 37.89 40.00 49.00
and the associated histogram represented above. Interestingly, the minimum is the number of comparisons needed to produce the maximum!
Reading the solution in Le Monde in the train to London and Bayes 250, I discovered that the author suggests a 7 comparison solution in the first case [compare A and B, C and D = 2 comparisons; if A>B and C>D, compare A and C and get, say A>C>D = 1 comparison; insert E in this ordering by comparing with C and then A or D = 2 comparisons, obtaining e.g. A>E>C>D; conclude by inserting B by first comparing with C then with D or E = 2 comparisons] and a 41 comparison solution in the second case. He is (or I am!) I was clearly mistaken in the first case while the quick3 algorithm does 41 or less most of the time (90%) but not always.