CoolData blog

9 July 2010

How to infer age, when all you have is a name

Filed under: Coolness, External data, Non-university settings, Predictor variables — kevinmacdonell @ 6:02 am

I rarely post on a Friday, let alone a Friday in the middle of summer, but today’s cool idea is somewhat half-baked. Its very flakiness suits the day and the weather. Actually, I think it has potential, but I’m interested to know what others think.

For those of us in higher-ed fundraising, ‘age’ or ‘class year’ is a key predictor variable. Not everyone has this information in their databases, however. What if you could sort of impute a “best guess” age, based on a piece of data that you do have: First name?

Names go in and out of fashion. You may have played around with this cool tool for visualizing baby-name trends. My own first name, Kevin, peaked in popularity in the 1970s and has been on a downward slide ever since (chart here). I was born in 1969, so that’s pretty close. My father’s name, Leo, has not been popular since the 1920s (he was born in 1930), but is having a slight comeback in recent years (chart here).

As for female names, my mother’s name, Yvonne, never ranked in the top 1,000 in any time period covered by this visualization tool, so I’ll use my niece’s name: Katelyn. She was born in 2005. This chart shows that two common spellings of her name peaked around that year. (The axis labeling is a bit wonky — you’ll have to hover your cursor over the display to get a good read on the timing of the peak.)

You can’t look up every first name one by one, obviously, so you’ll need a data set from another source that relates relative frequencies of names with age data. That sort of thing might be available in census data. But knowing somebody with access to a higher-ed database might be the easiest way.

I’ve performed a query on our database, pulling on just three fields: ID (to ensure I have unique records), First Name, and Age — for more than 87,000 alumni. (Other databases will have only Class Year — we’re fortunate in that we’ve got birth dates for nearly every living alum.) Here are a few sample rows, with ID number blanked out:

From here, it’s a pretty simple matter to copy the data into stats software (or Excel) to compute counts and median ages for each first name. Amazingly, just six first names account for 10% of all living, contactable alumni! (In order: John, David, Michael, Robert, James, and Jennifer.)

On the other hand, a lot of first names are unique in the database, or nearly so. To simplify things a bit, I calculated median ages only for names represented five or more times in the database. These 1,380 first names capture the vast majority of alumni.

The ten “oldest” names in the database are listed in the chart below, in descending order by median age. Have a look at these venerable handles. Of these, only Max has staged a rebound in recent years (according to the Baby Names visualizer).

And here are the ten “youngest names,” in ascending order by median age. It’s an interesting coincidence that the very youngest name is Katelyn — my five-year-old niece. One or two (such as Jake) were popular many years ago, and at least one has flipped gender from male to female (Whitney). Most of the others are new on the scene.

The real test is, do these median ages actually provide reasonable estimates of age for people who aren’t in the database?

I’m not in the database (as an alum). There are 371 Kevins in the database, and their median age is 43. I turned 41 in May, so that’s very good.

My father is also not an alum. The 26 Leos in the database have a median age of 50, which is unfortunately 30 years too young. Let’s call that one a ‘miss’.

My mother’s ‘predicted’ age is off by half that — 15 years — that’s not too bad.

Here’s how my three siblings’ names fare: Angela (predicted 36, actual 39 — very good), Paul (predicted 48, actual 38 — fair), and Francis (predicted 60, actual 36 — poor). Clearly there’s an issue with Francis, which according to the Baby Names chart tool was popular many decades ago but not when my brother was named. In other words, results for individuals may vary!

So let’s say you’re a non-profit without access to age data for your database constituents. How does this help you? Well it doesn’t — not directly. You will need to find a data partner at a university who will prepare a file for you, just as I’ve done above. When you import the data into your model, you can match up records by first name and voila, you’ve got a variable that gives you a rough estimate of age. (Sometimes very rough — but it’s better than nothing.)

This is only an idea of mine. I don’t know if anyone has actually done this, so I’d be interested to hear from others. Here are a few additional thoughts:

  • There shouldn’t be any privacy concern — all you want is a list of first names and their median ages, NOT IDs or last names — but be sure to get all necessary approvals.
  • To anticipate your question, no, I won’t provide you my own file. I think you’d be much better off getting a names file from a university in your own city or region, which will provide a more accurate reflection of the ethnic flavour of your constituency.
  • I used “First name”, but of course universities collect first, middle and last names, and the formal first name might not be the preferred moniker. If the university database has a “Preferred first name” field that is fully populated, that might be a better option for matching with your first-name field.
  • Again, there might be more accessible sources of name- or age-related data out there. This idea just sounded fun to try!

4 Comments »

  1. Personally, I’d be nervous about using that approach. Maybe also try creating a random prediction for each person bounded between one standard deviation of your population’s ages.

    Then, compare the overall variance between predicted / actual between the two processes.

    Another approach that definitely works involves using social relationships to infer likely age ranges based on interaction patterns. By doing it right, I’ve been able to get within a typical error of around +-2 years.

    Comment by Evan — 12 July 2010 @ 12:06 am

    • Good test idea. Another thought I had would be to compare the correlations between LT giving and actual age vs. LT giving and predicted age, within my own data. Your test is better. I think I already know what to expect from my own test.

      And cool approach to inferring age. Curious about where social relationship data would be available from, for the audience of this blog … many universities have in-house online communities, but universities already have age data. And Facebook does a better job for most alumni so in-house communities are not terribly well populated.

      Comment by kevinmacdonell — 12 July 2010 @ 7:09 am

      • You’re right, odds are good that it might not translate very effectively for your readers – it’s more applicable in telco / banking. However, I hope it’s still useful to know in the event that someone gets the chance to apply it! 🙂

        Comment by Evan — 15 July 2010 @ 2:03 am

  2. […] Here’s a cool idea: A person’s first name can be indicative of his or her probable age. Most first names have varied widely in popularity over the years, and you can take advantage of that fact. Someone named Eldred is probably not a 20-something, while someone named Britney is probably not a retiree. Finding out what they probably ARE is something I’ve written about here: How to infer age when all you have is a name. […]

    Pingback by Putting an age-guessing trick to the test « CoolData blog — 21 February 2012 @ 5:54 am


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: