By Gary R. Moser, Director of Institutional Research and Planning, California State University Maritime Academy
During an IR practitioner’s typical work day (if there even is such a thing), one is unlikely to invent problems to solve just for fun; there are enough real problems to work on as it is. However, I am a believer in the importance of “Toy Models” (TM) as a way to gain fluency with tools and methods that one might not otherwise encounter.
Toy models are great for getting your feet wet with new statistical methods, or as a way to develop a deeper, more intuitive understanding of ideas you’re already familiar with. They even have a place in our “real” work. For example, they can be used as a way to get a reality check on data that seem questionable. Not only is this a great way to learn new skills, but the ability to develop useful TMs is a skill unto itself that you should practice.
Toy Models have several useful properties that greatly accelerate learning:
They can be complex enough to be useful, but simple enough to avoid typical problems encountered with real data, such as missing values or sample problems.
TMs can be worked on data you make up yourself, or a particularly well-suited existing data set.
TMs are low-stakes, which means you’ll be more willing to try new approaches or even try to work a problem multiple ways and compare the results. You can put it away and come back to it whenever you like.
They are interesting and fun; if not, you’re doing it wrong!
In this tip I’m going to practice working with dates, sequences, and various data objects in R. The specific software is not important here – try to replicate this example using your software of choice if it’s not R. The “problem” I’m going to create a simulation for is the very counter-intuitive probability that any two people in a group will have the same birthday (month/day) as a function of the group’s size.
For those unfamiliar with the R environment, I’ll be showing snippets of code from the script and snippets of the resulting output from the console.
First, I load the libraries that contain special functions I want to use:
Next, I calculate the probabilities that any two people will have the same birthday in a group of sizes 20, 40, and 60.
...which produces the following probabilities (n=20: 44.4%, n=40: 90.3%, and n=60: 99.5%):
That’s very counter-intuitive! In a group of 40 people there is a 90.3% chance that two people have the same birthday. I believe the calculation is correct, but I just can’t resist simulating this with actual data to see for myself. Here, I create a vector of dates that occur in a typical year and check to make sure it looks right:
OK, this looks right:
Next, I initialize 3 “lists” (data objects to hold the sample data) then draw 10,000 samples of sizes 20, 40, and 60 and populate the lists with them:
Let’s see the first sample of n=20 out of 10,000 samples:
Now, I check for “duplicates” (aka, people having the same birthday) within each sample. If there was a duplicate within the sample, a value of 1 would be returned for that sample; if not, a value of 0 will would be returned. Finally, I sum all the resulting values to find out the proportion of samples with duplicates on birthday:
Sure enough, the approximate predicted proportions are what we observe:
…and a quick-and-dirty visualization to accompany it is shown below:
That was really satisfying, and along the way I refreshed my memory on how to do a few things and even picked up a few new tricks. I hope the next time you find yourself craving a better understanding of some principle or method you’ll take the opportunity to model it for yourself!