mad statistic
In the motivating toy example to our ABC model choice paper, we compare summary statistics, mean, median, variance, and… median absolute deviation (mad). The latest is the only one able to discriminate between our normal and Laplace models (as now discussed on Cross Validated!). When rerunning simulations to produce nicer graphical outcomes (for the revision), I noticed a much longer run time associated with the computation of the mad statistic. Here is a comparison for the computation of the mean, median, and mad on identical simulations:
> system.time(mmean(10^5)) user system elapsed 4.040 0.056 4.350 > system.time(mmedian(10^5)) user system elapsed 12.509 0.012 15.353 > system.time(mmad(10^5)) user system elapsed 23.345 0.036 23.458
Now, this is not particularly surprising: computing a median takes longer than computing a mean, even using quicksort!, hence computing two medians… Still, having to wait about six times longer for the delivery of a mad statistics is somehow…mad!
April 30, 2012 at 5:06 pm
Similar to Iain’s comment, but just to note that the deterministic O(n) time median algorithm (median of medians) is often slower than the expected O(n) time (worst case O(n^2)) algorithm (Hoare’s selection algorithm), which is a minor modification of quicksort (along the same lines as what Iain said).
If you are computing many medians of at least moderately sized data sets, e.g. at each iteration of an MCMC kernel, it may be worthwhile to do so using Hoare’s selection algorithm. I found in my MSc. thesis that this made a pretty big difference in computation time (I was computing a lot of medians at each step of a particle filter).
April 30, 2012 at 8:47 pm
Thanks to both Iain and Anthony for those remarks. Is there an R function that would go faster than median() and mad()?
May 2, 2012 at 3:43 pm
Unfortunately, I don’t know if anyone has implemented one of these in R.
April 30, 2012 at 4:08 pm
What are these functions, mmedian/mmad? Are they part of some R package?
April 30, 2012 at 4:50 pm
no, they are my functions, computing mean, median, and mad a certain number of times to get a stable system.time(.)….
April 30, 2012 at 1:32 pm
It is possible to find a median in O(n), rather than the quicksort’s O(n.log(n)): there’s no need to sort buckets that can’t contain the median. But I rarely bother with these fancy methods; sorting is hardly ever a bottleneck in any of my code.
If it really is taking lots of time for *huge* vectors, do we really need the exact median? One could get a decent estimate from a subsample. Not looking at all the data is cheaper than O(n) :-).