mad statistic

In the motivating toy example to our ABC model choice paper, we compare summary statistics, mean, median, variance, and… median absolute deviation (mad). The latest is the only one able to discriminate between our normal and Laplace models (as now discussed on Cross Validated!). When rerunning simulations to produce nicer graphical outcomes (for the revision), I noticed a much longer run time associated with the computation of the mad statistic. Here is a comparison for the computation of the mean, median, and mad on identical simulations:

> system.time(mmean(10^5))
   user  system elapsed
  4.040   0.056   4.350
> system.time(mmedian(10^5))
user  system elapsed
12.509   0.012  15.353
> system.time(mmad(10^5))
   user  system elapsed
 23.345   0.036  23.458

Now, this is not particularly surprising: computing a median takes longer than computing a mean, even using quicksort!, hence computing two medians… Still, having to wait about six times longer for the delivery of a mad statistics is somehow…mad!

6 Responses to “mad statistic”

  1. Similar to Iain’s comment, but just to note that the deterministic O(n) time median algorithm (median of medians) is often slower than the expected O(n) time (worst case O(n^2)) algorithm (Hoare’s selection algorithm), which is a minor modification of quicksort (along the same lines as what Iain said).

    If you are computing many medians of at least moderately sized data sets, e.g. at each iteration of an MCMC kernel, it may be worthwhile to do so using Hoare’s selection algorithm. I found in my MSc. thesis that this made a pretty big difference in computation time (I was computing a lot of medians at each step of a particle filter).

  2. What are these functions, mmedian/mmad? Are they part of some R package?

  3. Iain Murray Says:

    It is possible to find a median in O(n), rather than the quicksort’s O(n.log(n)): there’s no need to sort buckets that can’t contain the median. But I rarely bother with these fancy methods; sorting is hardly ever a bottleneck in any of my code.

    If it really is taking lots of time for *huge* vectors, do we really need the exact median? One could get a decent estimate from a subsample. Not looking at all the data is cheaper than O(n) :-).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 604 other followers