This question came to me recently via email: What’s a good way to estimate the age of database constituents, when a birth date is missing? The person who asked me wanted to use ‘age’ in some predictive models for giving, but was missing a lot of birth date data.

This is an interesting problem, and finding an answer to it has practical implications. Age is probably the most significant predictor in most giving models. It might be negative in a donor-acquisition model, but positive in almost any other type (renewal, upgrade, major giving). For those of us in higher ed, ‘year of graduation’ is a good proxy for age just as it is. But if you want to include non-degreed alumni (without an ‘expected year of graduation’), friends of the university who are not spouses (you can guess spouse ages somewhat accurately), or other non-graduates, or if you work for a nonprofit or business that has only partial age data, then you might need to get creative.

Here’s a cool idea: A person’s first name can be indicative of his or her probable age. Most first names have varied widely in popularity over the years, and you can take advantage of that fact. Someone named Eldred is probably not a 20-something, while someone named Britney is probably not a retiree. Finding out what they probably ARE is something I’ve written about here: How to infer age when all you have is a name.

It’s simple. If you have age data for at least a portion of your database:

- Pull all first names of living individuals from your database, with their ages.
- Calculate the average (or median) age for each first name. (Example: The median age of the 371 Kevins in our database is 43.) This is a job for stats software.
- For any individual who is missing an age, assign them the average (or median) age of people with the same first name.

When I wrote my first post on this topic, I put the idea out there but didn’t actually test it. It sounds approximate and unreliable, but I didn’t test it because I have no personal need for guessing ages: I’ve got birth dates for nearly every living alum.

Today I will address that omission.

I pulled a data file of about 104,000 living alumni, excluding any for whom we don’t have a birth date. All I requested was ID, First Name, and Age. (I also requested the sum of lifetime giving for each record, but I’ll get to that later.) I pasted these variables into a stats package (Data Desk), and then split the file into random halves of about 52,000 records each. I used only the first half to calculate the average age for each unique first name, rounding the average to the nearest whole number.

I then turned my attention to the ‘test’ half of the file. I tagged each ID with a ‘guessed age’, based on first name, as calculated using the first half of the file.

How did the guessed ages compare with peoples’ real ages?

I guessed the correct age for 3.5% of people in the test sample. That’s may not sound great, but I didn’t expect to be exactly right very often: I expected to be in the ballpark. In 17.5% of cases, I was correct to within plus or minus two years. In 37.6% of cases, I was correct to within plus or minus five years. And in 63.5% of cases, I was correct to within plus or minus 10 years. That’s the limit of what I would consider a reasonable guess. For what it’s worth, in order to reach 80% of cases, I would need to expand the acceptable margin of error to plus or minus 15 years — a span of 30 years is a bit too broad to consider “close”.

I also calculated median age, just in case the median would be a better guess than the average. This time, I guessed the correct age in 3.7% of cases — just a little better than when I used the average, which was also true as I widened the margin of error. In 18.5% of cases, I was correct to within plus or minus two years. In 38.8% of cases, I was correct to within plus or minus five years. And in 64.1% of cases, I was correct to within plus or minus 10 years. So not much of a difference in accuracy between the two types of guesses.

Here’s a chart showing the distribution of errors for the test half of the alumni sample (Actual Age minus Guessed Age), based on the median calculation:

The distribution seems slightly right-skewed, but in general a guess is about as likely to be “too old” as “too young.” Some errors are extreme, but they are relatively few in number. That has more to do with the fact that people live only so long, which sets a natural limit on how wrong I can be.

Accuracy would be nice, but a variable doesn’t need to be very accurate to be usable in a predictive model. Many inputs are not measured accurately, but we would never toss them out for that reason, if they were independent and had predictive power. Let’s see how a guessed-age variable compares to a true-age variable in a regression analysis. Here is the half of the sample for whom I used “true age”:

The dependent variable is ‘lifetime giving’ (log-transformed), and the sole predictor is ‘age’, which accounts for almost 15% of the variability in LTG (as we interpret the R squared statistic). It’s normal for age to play a huge part in any model trained on lifetime giving.

Now we want to see the “test” half, for whom we only guessed at constituents’ ages. Here is a regression using guessed ages (based on the average age). The variable is named “avg age new”:

This tells me that a guessed age isn’t nearly as reliable as the real thing, which is not a big surprise. The model fit has dropped to an R squared of only .05 (5%). Still, that’s not bad. As well, the p-value is very small, which suggests the variable is significant, and not some random effect. It’s a lot better than having nothing at all.

Finally, for good measure, here’s another regression, this time using median age as the predictor. The result is practically the same.

If I had to use this trick, I probably would. But will it help *you*? That depends. What is significant in my model might not be in yours, and to be honest, with the large sample I have here, achieving “significance” isn’t that hard. If three-quarters of the records in your database are missing age data, this technique will give only a very fuzzy approximation of age and probably won’t be all that useful. If only one-quarter are missing, then I’d say go for it: This trick will perform much better than simply plugging in a constant value for all missing ages (which would be one lazy approach to missing data needed for a regression analysis).

Give it a try, and have fun with it.

**P.S.:** A late-coming additional thought. What if you compare these results with simply plugging in the average or median age for the sample? Using the sample average (46 years old):

- Exact age: correct 2.2% of the time (compared to 3.5% for the first-name trick)
- Within +/- 2 years: correct 11.1% of the time (compared to 17.5%)
- Within +/- 5 years: correct 24.4% of the time (compared to 37.6%)
- Within +/- 10 years: correct 46.5% of the time (compared to 63.5%)

Plugging in the median instead hardly makes a difference in age-guessing accuracy. So, the first-name trick would seem to be an improvement.

The Social Security Agency publishes first age frequency by birth year from 1880-2010 in machine readable format. Another option is multivariate imputation: predicting birth year based on first name from your house file, first name from the SSA, giving, and other available data.

Comment by heuristicandrew — 6 March 2012 @ 5:30 pm

Do you mean first “name” frequency? Cool.

Comment by kevinmacdonell — 6 March 2012 @ 5:32 pm

Oops, yes. Try this URL http://www.ssa.gov/oact/babynames/limits.html

Comment by heuristicandrew — 6 March 2012 @ 5:34 pm