CoolData blog

2 May 2013

New twists on inferring age from first name

Filed under: Analytics, Coolness, Data Desk, Fun — Tags: , , , — kevinmacdonell @ 6:14 am

Not quite three years ago I blogged about a technique for estimating the age of your database constituents when you don’t have any relevant data such as birth date or class year. It was based on the idea that many first names are typically “young” or “old.” I expanded on the topic in a followup post: Putting an age-guessing trick to the test. Until now, I’ve never had a reason to guess someone’s age — alumni data is pretty well supplied in that department. This very month, though, I have not one but two major modeling projects to work on that involve constituents with very little age data present. I’ve worked out a few improvements to the technique which I will share today.

First, here’s the gist of the basic idea. Picture two women, named Freda and Katelyn. Do you imagine one of them as older than the other? I’m guessing you do. From your own experience, you know that a lot of young women and girls are named Katelyn, and that few if any older women are. Even if you aren’t sure about Freda, you would probably guess she’s older. If you plug these names into babynamewizard.com, you’ll see that Freda was a very popular baby name in the early 1900s, but fell out of the Top 1000 list sometime in the 1980s. On the other hand, Katelyn didn’t enter the Top 1000 until the 1970s and is still popular.

To make use of this information you need to turn it into data. You need to acquire a lot of data on the frequency of first names and how young or old they tend to be. If you work for a university or other school, you’re probably in luck: You might have a lot of birth dates for your alumni or, failing that, you have class years which in most cases will be a good proxy for age. This will be the source you’ll use for guessing the age of everyone else in your database — friends, parents and other person constituents — who don’t have ages. If you have a donor database that contains no age data, you might be able to source age-by-first name data somewhere else.

Back to Freda and Katelyn … when I query our database I find that the average age of constituents named Freda is 69, while the average age for Katelyn is 25. For the purpose of building a model, for anyone named Freda without an age, I will just assume she is 69, and for anyone named Katelyn, 25. It’s as simple as creating a table with two columns (First name and Average age), and matching this to your data file via First Name. My table has more than 13,500 unique first names. Some of these are single initials, and not every person goes by their first name, but that doesn’t necessarily invalidate the average age associated with them.

I’ve tested this method, and it’s an improvement over plugging missing values with an all-database average or median age. For a data set that has no age data at all, it should provide new information that wasn’t there before — information that is probably correlated with behaviours such as giving.

Now here’s a new wrinkle.

In my first post on this subject, I noted that some of the youngest names in our database are “gender flips.” Some of the more recent popular names used to be associated with the opposite gender decades ago. This seems to be most prevalent with young female names: Ainslie, Isadore, Sydney, Shelly, Brooke. It’s harder to find examples going in the other direction, but there are a few, some of them perhaps having to do with differences in ethnic origin: Kori, Dian, Karen, Shaune, Mina, Marian. In my data I have close to 600 first names that belong to members of both sexes. When I calculate average age by First Name separately for each sex, some names end up with the exact same age for male and female. These names have an androgynous quality to them: Lyndsay, Riley, Jayme, Jesse, Jody. At the other extreme are the names that have definitely flipped gender, which I’ve already given examples of … one of the largest differences being for Ainslie. The average male named Ainslie is 54 years older than the average female of the same name. (In my data, that is.)

These differences suggest an improvement to our age-inferring method: Matching on not just First Name, but Sex as well. Although only 600 of my names are double-gendered, they include many popular names, so that they actually represent almost one-quarter of all constituents.

Now here’s another wrinkle.

When we’re dealing with constituents who aren’t alumni, we may be missing certain personal information such as Sex. If we plan to match on Sex as well as First Name, we’ve got a problem. If Name Prefix is present, we can infer from whether it’s Mr., Ms., etc., but unless the person doing the data entry was having an off day, this shouldn’t be an avenue available to us — it should already be filled in. (If you know it’s “Mrs.,” then why not put in F for Sex?) For those records without a Sex recorded (or have a Sex of ‘N’), we need to make a guess. To do so, we return to our First Names query and the Sex data we do have.

In my list of 600 first names that are double-gendered, not many are actually androgynous. We have females named John and Peter, and we have males named Mary and Laura, but we all know that given any one person named John, chances are we’re talking about a male person. Mary is probably female. These may be coding errors or they may be genuine, but in any case we can use majority usage to help us decide. We’ll sometimes get it wrong — there are indeed boys named Sue — but if you have 7,000 Johns in your database and only five of them are female, then let’s assume (just for the convenience of data mining*) that all Johns are male.

So: Query your database to retrieve every first name that has a Sex code, and count up the instance of each. The default sex for each first name is decided by the highest count, male or female. To get a single variable for this, I subtract the number of females from the number of males for each first name. Since the result is positive for males and negative for females, I call it a “Maleness Score” — but you can do the reverse and call it a Femaleness Score if you wish! Results of zero are considered ties, or ‘N’.

At this point we’ve introduced a bit of circularity. For any person missing Age and Sex, first we have to guess their sex based on the majority code assigned to that person’s first name, and then go back to the same data to grab the Age that matches up with Name and Sex. Clearly we are going to get it very wrong for a lot of records. You can’t expect these guesses to hold up as well as true age data. Overall, though, there should be some signal in all that noise … if your model believes that “Edgar” is male and 72 years of age, and that “Brittany” is female and 26, well, that’s not unreasonable and it’s probably not far from the truth.

How do we put this all together? I build my models in Data Desk, so I need to get all these elements into my data file as individual variables. You can do this any way that works for you, but I use our database querying software (Hyperion Brio). I import the data into Brio as locally-saved tab-delimited files and join them up as you see below. The left table is my modeling data (or at least the part of it that holds First Name), and the two tables on the right hold the name-specific ages and sexes from all the database records that have this information available. I left-join each of these tables on the First Name field.

age_tablesWhen I process the query, I get one row per ID with the fields from the left-hand table, plus the fields I need from the two tables on the right: the so-called Maleness Score, Female Avg Age by FName, Male Avg Age by Fname, and N Avg Age by Fname. I can now paste these as new variables into Data Desk. I still have work to do, though: I do have a small amount of “real” age data that I don’t want to overwrite, and not every First Name has a match in the alumni database. I have to figure out what I have, what I don’t have, and what I’m going to do to get a real or estimated age plugged in for every single record. I write an expression called Age Estimated to choose an age based on a hierarchical set of IF statements. The text of my expression is below — I will explain it in plain English following the expression.

if len('AGE')>0 then 'AGE'

else if textof('SEX')="M" and len('M avg age by Fname')>0 then 'M avg age by Fname'
else if textof('SEX')="M" and len('N avg age by Fname')>0 then 'N avg age by Fname'
else if textof('SEX')="M" and len('F avg age by Fname')>0 then 'F avg age by Fname'

else if textof('SEX')="F" and len('F avg age by Fname')>0 then 'F avg age by Fname'
else if textof('SEX')="F" and len('N avg age by Fname')>0 then 'N avg age by Fname'
else if textof('SEX')="F" and len('M avg age by Fname')>0 then 'M avg age by Fname'

else if textof('SEX')="N" and 'Maleness score'>0 and len('M avg age by Fname')>0 then 'M avg age by Fname'
else if textof('SEX')="N" and 'Maleness score'<0 and len('F avg age by Fname')>0 then 'F avg age by Fname'
else if textof('SEX')="N" and 'Maleness score'=0 and len('N avg age by Fname')>0 then 'N avg age by Fname'

else if len('N avg age by Fname')>0 then 'N avg age by Fname'
else if len('F avg age by Fname')>0 then 'F avg age by Fname'
else if len('M avg age by Fname')>0 then 'M avg age by Fname'

else 49

Okay … here’s what the expression actually does, going block by block through the statements:

  1. If Age is already present, then use that — done.
  2. Otherwise, if Sex is male, and the average male age is available, then use that. If there’s no average male age, then use the ‘N’ age, and if that’s not available, use the female average age … we can hope it’s better than no age at all.
  3. Otherwise if Sex is female, and the average female age is available, then use that. Again, go with any other age that’s available.
  4. Otherwise if Sex is ‘N’, and the Fname is likely male (according to the so-called Maleness Score), then use the male average age, if it’s available. Or if the first name is probably female, use the female average age. Or if the name is tied male-female, use the ‘N’ average age.
  5. Otherwise, as it appears we don’t have anything much to go on, just use any available average age associated with that first name: ‘N’, female, or male.
  6. And finally, if all else fails (which it does for about 6% of my file, or 7,000 records), just plug in the average age of every constituent in the database who has an age, which in our case is 49. This number will vary depending on the composition of your actual data file — if it’s all Parents, for example, then calculate the average of Parents’ known ages, excluding other constituent types.

When I bin the cases into 20 roughly equal groups by Estimated Age, I see that the percentage of cases that have some giving history starts very low (about 3 percent for the youngest group), rises rapidly to more than 10 percent, and then gradually rises to almost 18 percent for the oldest group. That’s heading in the right direction at least. As well, being in the oldest 5% is also very highly correlated with Lifetime Giving, which is what we would expect from a donor data set containing true ages.

est_age_vingt

This is a bit of work, and probably the gain will be marginal a lot of the time. Data on real interactions that showed evidence of engagement would be superior to age-guessing, but when data is scarce a bit of added lift can’t hurt. If you’re concerned about introducing too much noise, then build models with and without Estimated Age, and evaluate them against each other. If your software offers multiple imputation for missing data as a feature, try checking that out … what I’m doing here is just a very manual form of multiple imputation — calculating plausible values for missing data based on the values of other variables. Be careful, though: A good predictor of Age happens to be Lifetime Giving, and if your aim is to predict Giving, I should think there’s a risk your model will suffer from feedback.

* One final note …

Earlier on I mentioned assuming someone is male or female “just for the convenience of data mining.”  In our databases (and in a conventional, everyday sense too), we group people in various ways — sex, race, creed. But these categories are truly imperfect summaries of reality. (Some more imperfect than others!) A lot of human diversity is not captured in data, including things we formerly thought of as clear-cut. Sex seems conveniently binary, but in reality it is multi-category, or maybe it’s a continuous variable. (Or maybe it’s too complex for a single variable.) In real life I don’t assume that when someone in the Registrar’s Office enters ‘N’ for Sex that the student’s data is merely missing. Because the N category is still such a small slice of the population I might treat it as missing, or reapportion it to either Male or Female as I do here. But that’s strictly for predictive modeling. It’s not a statement about transgendered or differently gendered people nor an opinion about where they “belong.”

21 February 2012

Putting an age-guessing trick to the test

Filed under: Alumni, Predictor variables — Tags: , , , , — kevinmacdonell @ 5:53 am

This question came to me recently via email: What’s a good way to estimate the age of database constituents, when a birth date is missing? The person who asked me wanted to use ‘age’ in some predictive models for giving, but was missing a lot of birth date data.

This is an interesting problem, and finding an answer to it has practical implications. Age is probably the most significant predictor in most giving models. It might be negative in a donor-acquisition model, but positive in almost any other type (renewal, upgrade, major giving). For those of us in higher ed, ‘year of graduation’ is a good proxy for age just as it is. But if you want to include non-degreed alumni (without an ‘expected year of graduation’), friends of the university who are not spouses (you can guess spouse ages somewhat accurately), or other non-graduates, or if you work for a nonprofit or business that has only partial age data, then you might need to get creative.

Here’s a cool idea: A person’s first name can be indicative of his or her probable age. Most first names have varied widely in popularity over the years, and you can take advantage of that fact. Someone named Eldred is probably not a 20-something, while someone named Britney is probably not a retiree. Finding out what they probably ARE is something I’ve written about here: How to infer age when all you have is a name.

It’s simple. If you have age data for at least a portion of your database:

  1. Pull all first names of living individuals from your database, with their ages.
  2. Calculate the average (or median) age for each first name. (Example: The median age of the 371 Kevins in our database is 43.) This is a job for stats software.
  3. For any individual who is missing an age, assign them the average (or median) age of people with the same first name.

When I wrote my first post on this topic, I put the idea out there but didn’t actually test it. It sounds approximate and unreliable, but I didn’t test it because I have no personal need for guessing ages: I’ve got birth dates for nearly every living alum.

Today I will address that omission.

I pulled a data file of about 104,000 living alumni, excluding any for whom we don’t have a birth date. All I requested was ID, First Name, and Age. (I also requested the sum of lifetime giving for each record, but I’ll get to that later.) I pasted these variables into a stats package (Data Desk), and then split the file into random halves of about 52,000 records each. I used only the first half to calculate the average age for each unique first name, rounding the average to the nearest whole number.

I then turned my attention to the ‘test’ half of the file. I tagged each ID with a ‘guessed age’, based on first name, as calculated using the first half of the file.

How did the guessed ages compare with peoples’ real ages?

I guessed the correct age for 3.5% of people in the test sample. That’s may not sound great, but I didn’t expect to be exactly right very often: I expected to be in the ballpark. In 17.5% of cases, I was correct to within plus or minus two years. In 37.6% of cases, I was correct to within plus or minus five years. And in 63.5% of cases, I was correct to within plus or minus 10 years. That’s the limit of what I would consider a reasonable guess. For what it’s worth, in order to reach 80% of cases, I would need to expand the acceptable margin of error to plus or minus 15 years — a span of 30 years is a bit too broad to consider “close”.

I also calculated median age, just in case the median would be a better guess than the average. This time, I guessed the correct age in 3.7% of cases — just a little better than when I used the average, which was also true as I widened the margin of error. In 18.5% of cases, I was correct to within plus or minus two years. In 38.8% of cases, I was correct to within plus or minus five years. And in 64.1% of cases, I was correct to within plus or minus 10 years. So not much of a difference in accuracy between the two types of guesses.

Here’s a chart showing the distribution of errors for the test half of the alumni sample (Actual Age minus Guessed Age), based on the median calculation:

The distribution seems slightly right-skewed, but in general a guess is about as likely to be “too old” as “too young.” Some errors are extreme, but they are relatively few in number. That has more to do with the fact that people live only so long, which sets a natural limit on how wrong I can be.

Accuracy would be nice, but a variable doesn’t need to be very accurate to be usable in a predictive model. Many inputs are not measured accurately, but we would never toss them out for that reason, if they were independent and had predictive power. Let’s see how a guessed-age variable compares to a true-age variable in a regression analysis. Here is the half of the sample for whom I used “true age”:

The dependent variable is ‘lifetime giving’ (log-transformed), and the sole predictor is ‘age’, which accounts for almost 15% of the variability in LTG (as we interpret the R squared statistic). It’s normal for age to play a huge part in any model trained on lifetime giving.

Now we want to see the “test” half, for whom we only guessed at constituents’ ages. Here is a regression using guessed ages (based on the average age). The variable is named “avg age new”:

This tells me that a guessed age isn’t nearly as reliable as the real thing, which is not a big surprise. The model fit has dropped to an R squared of only .05 (5%). Still, that’s not bad. As well, the p-value is very small, which suggests the variable is significant, and not some random effect. It’s a lot better than having nothing at all.

Finally, for good measure, here’s another regression, this time using median age as the predictor. The result is practically the same.

If I had to use this trick, I probably would. But will it help you? That depends. What is significant in my model might not be in yours, and to be honest, with the large sample I have here, achieving “significance” isn’t that hard. If three-quarters of the records in your database are missing age data, this technique will give only a very fuzzy approximation of age and probably won’t be all that useful. If only one-quarter are missing, then I’d say go for it: This trick will perform much better than simply plugging in a constant value for all missing ages (which would be one lazy approach to missing data needed for a regression analysis).

Give it a try, and have fun with it.

P.S.: A late-coming additional thought. What if you compare these results with simply plugging in the average or median age for the sample? Using the sample average (46 years old):

  • Exact age: correct 2.2% of the time (compared to 3.5% for the first-name trick)
  • Within +/- 2 years: correct 11.1% of the time (compared to 17.5%)
  • Within +/- 5 years: correct 24.4% of the time (compared to 37.6%)
  • Within +/- 10 years: correct 46.5% of the time (compared to 63.5%)

Plugging in the median instead hardly makes a difference in age-guessing accuracy. So, the first-name trick would seem to be an improvement.

1 February 2010

Surveys and missing data

Filed under: Model building, Pitfalls, Surveying — Tags: , , , , — kevinmacdonell @ 9:24 pm

Survey data is valuable for predicting mass behaviours. Inevitably, though, some pieces will be missing. (Photo by Photo Monkey, used by Creative Commons license. Click photo for more.)

In a previous post I talked about the great predictive power of survey responses. Today I’ll explain what to do about one of the roadblocks you’ll encounter – missing data.

The problem of missing data is a big issue in statistics, and a number of techniques are available for dealing with it. The ideas I offer here may or may not meet the high standards of a statistician, but they do offer some more or less reasonable solutions.

Let’s say you’re interested in creating only one variable from the survey, an indicator variable which records whether a person participated or not. The mere presence or absence of this data point will probably be predictive. You could code all responders as ‘1’, and everyone else as ‘0’. This would work well if a large portion of your sample received an invite.

Alternately, you could code responders as ‘1’, non-responders as ‘-1’, and put the “uninvited,” everyone who didn’t receive an invitation, into the neutral middle zone by giving them a zero. To avoid negative numbers, just add one to each of these values; it may seem strange to reward the uninvited a ‘1’, but what you’re trying to do here is see if they differ from the people who actually had a chance to participate and chose not to: An action taken, in a negative direction.

Use only one or the other of these two variables, whichever one ‘works’. Test both against mean and median lifetime giving (or whatever your dependent variable is). If the three-level variable shows a nice linear relationship with your DV – with low giving for the non-responders, higher giving for the uninvited, and highest giving for the responders – then use that variable in your regression.

Statistically sound? Perhaps not. But if the alternative is tossing out a potentially valuable predictor, I don’t see the harm.

That covers missing data for the simple fact of participation / non-participation in a survey. You can go much deeper than that. A typical survey of alumni will yield many potential predictor variables. If your survey is getting at attitudes about your institution, or about giving or volunteering, or attending events, responses to individual questions can be powerfully predictive. Again, if you’re using regression, all you need to do is find a logical way to re-express the response as a number.

For example, you can recode yes/no questions as 1/0 indicator variables. If the responses to a question are categorical in nature (for a question such as, “What is your mother’s nationality?”), you may wish to test indicator variables for the various responses, and along with that, an indicator variable for “Did not answer the question.” In such a case, missing data may have its own underlying pattern (i.e. it is non-random), and may correlate with the value you’re trying to predict.

Sometimes the data is already expressed as a number. Surveys often use Likert scale questions, in which responders are asked to rate their level of agreement to a statement. Typical responses might range from 1 (“strongly disagree”) to 5 or 6 (“strongly agree”). Likert scales are, strictly speaking, ordinal in nature, not continuous: There’s no reason to believe that the “distance” between ‘1’ and ‘2’ is the same as the “distance” between ‘2’ and ‘3’. However, I accept them as a logical ranking, perfectly suitable for a regression independent variable.

(Incidentally, a survey designer will use an even-numbered scale if he wants to disallow neutral responses, a practice which gets an answer but sometimes causes frustrated survey-takers to skip questions.)

So: rich numerical data, ready to plug into our analysis, but only one problem: Missing data. Some people skip questions, some fail to take the survey, others were never invited in the first place. This time, you can’t just plug a zero into every empty space. Innocent non-responders would come off as a very negative bunch, completely throwing off your predictor. But again, if you don’t have some type of number present for that variable for all cases in your dataset, they’ll be excluded from the regression analysis. At the risk of oversimplifying, I would say you’ve got three options, with three levels of sophistication:

  1. Substitution of a neutral value.
  2. Substitution of a mean value.
  3. Imputation.

1. Neutral-value substitution. This is the method I used the first time I incorporated a lot of Likert-scale type data into a predictive model. It was very simple. Every person with a missing value for a given question received a value falling perfectly halfway between “strongly disagree” and “strongly agree.” For a scale of 1 to 6, that value is 3.5. Of course, ‘3.5’ was not a possible choice on the survey itself, which forced respondents to commit to a slightly negative or positive response, but that didn’t mean I couldn’t use the middle value to replace our unknown values.

There is one problem with this method, though … if you think about it, what’s so ‘neutral’ about 3.5? If you took all the actual responses and calculated the average, it might be significantly higher or lower than 3.5. Let’s say actual respondents had an average response of 5 for a particular question. If we code everyone else as 3.5, that’s characterizing them as negative, in relation to the respondents. We may have no basis for doing so.

2. Mean-value substitution. The problem I describe above can be addressed by mean-value substitution, which is the method I perhaps should have used. If the average response for your actual respondents is 2.67, then substitute 2.67 for all your missing values. If it’s 5, use 5. (If your response data is not Likert-scale in nature, but rather contains extreme values, use the median value for the variable rather than the average value.)

3. Imputation. This term is used to describe a variety of related methods for guessing the “most likely” missing value based on the values found in other variables. These methods include some advanced options made available in software such as SAS and SPSS.

The third option may be regarded as the best from a statistical point of view. Alas, I have not used these more advanced techniques. I can only speak from my experience with the first two. For now at least, I accept the drawbacks of substituting the population mean for missing data (one drawback being a gross underestimation of variance), in order for me to quickly and easily tap the power of survey data in my models.

What would YOU do?

Create a free website or blog at WordPress.com.