the birthday problem [X'idated]

The birthday problem (i.e. looking at the distribution of the birthdates in a group of n persons, assuming [wrongly] a uniform distribution of the calendar dates of those birthdates) is always a source of puzzlement [for me]! For instance, here is a recent post on Cross Validated:

I have 360 friends on facebook, and, as expected, the distribution of their birthdays is not uniform at all. I have one day with that has 9 friends with the same birthday. So, given that some days are more likely for a birthday, I’m assuming the number of 23 is an upperbound.

The figure 9 sounded unlikely, so I ran the following computation:

extreme=rep(0,360)
for (t in 1:10^5){
  i=max(diff((1:360)[!duplicated(sort(sample(1:365,360,rep=TRUE)))]))
  extreme[i]=extreme[i]+1
  }
extreme=extreme/10^5
barplot(extreme,xlim=c(0,30),names=1:360)

whose output shown on the above graph. (Actually, I must confess I first forgot the sort in the code, which led me to then believe that 9 was one of the most likely values and post it on Cross Validated! The error was eventually picked by one administrator. I should know better than trust my own R code!) According to this simulation, observing 9 or more people having the same birthdate has an approximate probability of 0.00032… Indeed, fairly unlikely!

Incidentally, this question led me to uncover how to print the above on this webpage. And to learn from the X’idated moderator whuber the use of tabulate. Which avoids the above loop:

> system.time(test(10^5)) #my code above
user  system elapsed
26.230   0.028  26.411
> system.time(table(replicate(10^5, max(tabulate(sample(1:365,360,rep=TRUE))))))
user  system elapsed
5.708   0.044   5.762

12 Responses to “the birthday problem [X'idated]”

  1. [...] problem generated by X’validated (on which I spent much too much time!): given an unbiased coin that produced M heads in the first M [...]

  2. Charalampos Chanialidis Says:

    I am not 100% sure so there is a chance of saying something that is completely wrong but I think that the code has a bug.
    Let’s say that:

    > x<-sort(sample(1:365,360,rep=TRUE)) 
    

    gives us sth like

    (1, 2, 3, ...,363,363,...,363)
    

    where the last 10 values is 363.
    That means that the !duplicated(x) in the last 10 positions will be
    T, F, F, …,F (one T and nine F's) where T is True and F is False.
    In that case the last value of diff((1:360)[!duplicated(x)]) will not be 10 (as it should) because the last difference it will compute will be from the T of the people who were born on the 363rd day of the year and the previous T.

    • Ok, thanks! You are right, if the largest number of common birthdates is the last birthdate of the series, the diff will not pick it up. This creates a small downward bias, mostly negligible. Now, I think the correction of the bug is to add 361 to the list of indices:

      max(diff(c((1:360)[!duplicated(sort(sample(1:365,360,rep=TRUE)))],361)))
      
  3. About computing the probability of observing 9 or more people having the same birthdate using Poisson approximation:

    the number of your friends having their birthday on one particular day in the year follows approximately a P(lambda) distribution, where lambda = 360/365; the probability that this number is 9 is approximately p9=exp(-l) l^9/9!, so the probability that none of the 365 days it 9 friends’ birthday is approximately (1-p9)^365 \approx 1-365p9 (and hardly different for 9 or more).

    Hence, the probability your are looking for is approximately 365p9 = 0.00033, quite close to your Monte-Carlo estimate.

    Note that all these approximations can be justified (and quantified) using Chen-Stein’s bound for Poisson approximation (which does not require independence).

  4. Actually, I think someone commented about the Galton-Watson process in relation with this post, but that I [or wordpress] mistakenly removed the comment. Please comment again!

  5. Actually one can assume that (while agreeing with the point Rick Wicklin is making) the distribution of birthdates in a group of n people can be modeled by the uniform distribution. Hence your sample(1:365,360) code.
    What is not uniformly distributed is the number of subjects in a group of n persons sharing the same month and day of birth.
    The same happens with a series of coin tosses actually:
    table(replicate(10^5, max(tabulate(sample(1:2,100,rep=TRUE)))))

  6. Charalampos Chanialidis Says:

    I think that you put one parenthesis more at the end of the third line.
    It should be rep=TRUE)))])) not rep=TRUE))))])).
    That said, I am still trying to figure it out.

    • Thanks, Charalampos, corrected! The !duplicated(…) part is selecting the (ordered) dates that have not yet appeared, then (1:360)[...] translates this into indices, and diff(..) gives the spread between two different and consecutive birthdates, hence the result…

    • Charalampos Chanialidis Says:

      Played around a bit in R and figured it out earlier, but thanks a lot for explaining it.

  7. Bill Huber Says:

    Nice work! Apropos the comment about trusting R code, I would like to share my experience with this problem. It began with a *theoretical calculation* (using independent Poisson distributions for the counts of individual birthdays) which, although not perfectly accurate (due to lack of independence), quickly provided a good sense of what a correct answer should look like. To support that calculation I ran (fast, easy) simulations on two different platforms (Mathematica and R). Remember, the software (usually!) works correctly; if it errs, the problem lies in what the programmer told it to do. The combination of theoretical calculations supported by simulations (rather than either one alone) can provide a high level of confidence and trustworthiness in the results.

  8. To answer the question: If you assume uniformity then the estimates of probability are lower bounds for the (real) case of a nonuniform distribution of birthdays. A readable, elementary, proof is Munford, 1977, TAS, 31(3), p. 119. For a discussion and presentation of real data for US births in 2002, see http://blogs.sas.com/content/iml/2011/09/09/the-most-likely-birthday-in-the-us/

    For an interesting variation, imagine a room that contains k people. What is the probability that two people in the room share the same initials (first and last)? This is described, with a heatmap graphic, at http://blogs.sas.com/content/iml/2011/01/14/two-letter-initials-which-are-the-most-common/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 557 other followers