CoolData blog

2 May 2013

New twists on inferring age from first name

Filed under: Analytics, Coolness, Data Desk, Fun — Tags: , , , — kevinmacdonell @ 6:14 am

Not quite three years ago I blogged about a technique for estimating the age of your database constituents when you don’t have any relevant data such as birth date or class year. It was based on the idea that many first names are typically “young” or “old.” I expanded on the topic in a followup post: Putting an age-guessing trick to the test. Until now, I’ve never had a reason to guess someone’s age — alumni data is pretty well supplied in that department. This very month, though, I have not one but two major modeling projects to work on that involve constituents with very little age data present. I’ve worked out a few improvements to the technique which I will share today.

First, here’s the gist of the basic idea. Picture two women, named Freda and Katelyn. Do you imagine one of them as older than the other? I’m guessing you do. From your own experience, you know that a lot of young women and girls are named Katelyn, and that few if any older women are. Even if you aren’t sure about Freda, you would probably guess she’s older. If you plug these names into, you’ll see that Freda was a very popular baby name in the early 1900s, but fell out of the Top 1000 list sometime in the 1980s. On the other hand, Katelyn didn’t enter the Top 1000 until the 1970s and is still popular.

To make use of this information you need to turn it into data. You need to acquire a lot of data on the frequency of first names and how young or old they tend to be. If you work for a university or other school, you’re probably in luck: You might have a lot of birth dates for your alumni or, failing that, you have class years which in most cases will be a good proxy for age. This will be the source you’ll use for guessing the age of everyone else in your database — friends, parents and other person constituents — who don’t have ages. If you have a donor database that contains no age data, you might be able to source age-by-first name data somewhere else.

Back to Freda and Katelyn … when I query our database I find that the average age of constituents named Freda is 69, while the average age for Katelyn is 25. For the purpose of building a model, for anyone named Freda without an age, I will just assume she is 69, and for anyone named Katelyn, 25. It’s as simple as creating a table with two columns (First name and Average age), and matching this to your data file via First Name. My table has more than 13,500 unique first names. Some of these are single initials, and not every person goes by their first name, but that doesn’t necessarily invalidate the average age associated with them.

I’ve tested this method, and it’s an improvement over plugging missing values with an all-database average or median age. For a data set that has no age data at all, it should provide new information that wasn’t there before — information that is probably correlated with behaviours such as giving.

Now here’s a new wrinkle.

In my first post on this subject, I noted that some of the youngest names in our database are “gender flips.” Some of the more recent popular names used to be associated with the opposite gender decades ago. This seems to be most prevalent with young female names: Ainslie, Isadore, Sydney, Shelly, Brooke. It’s harder to find examples going in the other direction, but there are a few, some of them perhaps having to do with differences in ethnic origin: Kori, Dian, Karen, Shaune, Mina, Marian. In my data I have close to 600 first names that belong to members of both sexes. When I calculate average age by First Name separately for each sex, some names end up with the exact same age for male and female. These names have an androgynous quality to them: Lyndsay, Riley, Jayme, Jesse, Jody. At the other extreme are the names that have definitely flipped gender, which I’ve already given examples of … one of the largest differences being for Ainslie. The average male named Ainslie is 54 years older than the average female of the same name. (In my data, that is.)

These differences suggest an improvement to our age-inferring method: Matching on not just First Name, but Sex as well. Although only 600 of my names are double-gendered, they include many popular names, so that they actually represent almost one-quarter of all constituents.

Now here’s another wrinkle.

When we’re dealing with constituents who aren’t alumni, we may be missing certain personal information such as Sex. If we plan to match on Sex as well as First Name, we’ve got a problem. If Name Prefix is present, we can infer from whether it’s Mr., Ms., etc., but unless the person doing the data entry was having an off day, this shouldn’t be an avenue available to us — it should already be filled in. (If you know it’s “Mrs.,” then why not put in F for Sex?) For those records without a Sex recorded (or have a Sex of ‘N’), we need to make a guess. To do so, we return to our First Names query and the Sex data we do have.

In my list of 600 first names that are double-gendered, not many are actually androgynous. We have females named John and Peter, and we have males named Mary and Laura, but we all know that given any one person named John, chances are we’re talking about a male person. Mary is probably female. These may be coding errors or they may be genuine, but in any case we can use majority usage to help us decide. We’ll sometimes get it wrong — there are indeed boys named Sue — but if you have 7,000 Johns in your database and only five of them are female, then let’s assume (just for the convenience of data mining*) that all Johns are male.

So: Query your database to retrieve every first name that has a Sex code, and count up the instance of each. The default sex for each first name is decided by the highest count, male or female. To get a single variable for this, I subtract the number of females from the number of males for each first name. Since the result is positive for males and negative for females, I call it a “Maleness Score” — but you can do the reverse and call it a Femaleness Score if you wish! Results of zero are considered ties, or ‘N’.

At this point we’ve introduced a bit of circularity. For any person missing Age and Sex, first we have to guess their sex based on the majority code assigned to that person’s first name, and then go back to the same data to grab the Age that matches up with Name and Sex. Clearly we are going to get it very wrong for a lot of records. You can’t expect these guesses to hold up as well as true age data. Overall, though, there should be some signal in all that noise … if your model believes that “Edgar” is male and 72 years of age, and that “Brittany” is female and 26, well, that’s not unreasonable and it’s probably not far from the truth.

How do we put this all together? I build my models in Data Desk, so I need to get all these elements into my data file as individual variables. You can do this any way that works for you, but I use our database querying software (Hyperion Brio). I import the data into Brio as locally-saved tab-delimited files and join them up as you see below. The left table is my modeling data (or at least the part of it that holds First Name), and the two tables on the right hold the name-specific ages and sexes from all the database records that have this information available. I left-join each of these tables on the First Name field.

age_tablesWhen I process the query, I get one row per ID with the fields from the left-hand table, plus the fields I need from the two tables on the right: the so-called Maleness Score, Female Avg Age by FName, Male Avg Age by Fname, and N Avg Age by Fname. I can now paste these as new variables into Data Desk. I still have work to do, though: I do have a small amount of “real” age data that I don’t want to overwrite, and not every First Name has a match in the alumni database. I have to figure out what I have, what I don’t have, and what I’m going to do to get a real or estimated age plugged in for every single record. I write an expression called Age Estimated to choose an age based on a hierarchical set of IF statements. The text of my expression is below — I will explain it in plain English following the expression.

if len('AGE')>0 then 'AGE'

else if textof('SEX')="M" and len('M avg age by Fname')>0 then 'M avg age by Fname'
else if textof('SEX')="M" and len('N avg age by Fname')>0 then 'N avg age by Fname'
else if textof('SEX')="M" and len('F avg age by Fname')>0 then 'F avg age by Fname'

else if textof('SEX')="F" and len('F avg age by Fname')>0 then 'F avg age by Fname'
else if textof('SEX')="F" and len('N avg age by Fname')>0 then 'N avg age by Fname'
else if textof('SEX')="F" and len('M avg age by Fname')>0 then 'M avg age by Fname'

else if textof('SEX')="N" and 'Maleness score'>0 and len('M avg age by Fname')>0 then 'M avg age by Fname'
else if textof('SEX')="N" and 'Maleness score'<0 and len('F avg age by Fname')>0 then 'F avg age by Fname'
else if textof('SEX')="N" and 'Maleness score'=0 and len('N avg age by Fname')>0 then 'N avg age by Fname'

else if len('N avg age by Fname')>0 then 'N avg age by Fname'
else if len('F avg age by Fname')>0 then 'F avg age by Fname'
else if len('M avg age by Fname')>0 then 'M avg age by Fname'

else 49

Okay … here’s what the expression actually does, going block by block through the statements:

  1. If Age is already present, then use that — done.
  2. Otherwise, if Sex is male, and the average male age is available, then use that. If there’s no average male age, then use the ‘N’ age, and if that’s not available, use the female average age … we can hope it’s better than no age at all.
  3. Otherwise if Sex is female, and the average female age is available, then use that. Again, go with any other age that’s available.
  4. Otherwise if Sex is ‘N’, and the Fname is likely male (according to the so-called Maleness Score), then use the male average age, if it’s available. Or if the first name is probably female, use the female average age. Or if the name is tied male-female, use the ‘N’ average age.
  5. Otherwise, as it appears we don’t have anything much to go on, just use any available average age associated with that first name: ‘N’, female, or male.
  6. And finally, if all else fails (which it does for about 6% of my file, or 7,000 records), just plug in the average age of every constituent in the database who has an age, which in our case is 49. This number will vary depending on the composition of your actual data file — if it’s all Parents, for example, then calculate the average of Parents’ known ages, excluding other constituent types.

When I bin the cases into 20 roughly equal groups by Estimated Age, I see that the percentage of cases that have some giving history starts very low (about 3 percent for the youngest group), rises rapidly to more than 10 percent, and then gradually rises to almost 18 percent for the oldest group. That’s heading in the right direction at least. As well, being in the oldest 5% is also very highly correlated with Lifetime Giving, which is what we would expect from a donor data set containing true ages.


This is a bit of work, and probably the gain will be marginal a lot of the time. Data on real interactions that showed evidence of engagement would be superior to age-guessing, but when data is scarce a bit of added lift can’t hurt. If you’re concerned about introducing too much noise, then build models with and without Estimated Age, and evaluate them against each other. If your software offers multiple imputation for missing data as a feature, try checking that out … what I’m doing here is just a very manual form of multiple imputation — calculating plausible values for missing data based on the values of other variables. Be careful, though: A good predictor of Age happens to be Lifetime Giving, and if your aim is to predict Giving, I should think there’s a risk your model will suffer from feedback.

* One final note …

Earlier on I mentioned assuming someone is male or female “just for the convenience of data mining.”  In our databases (and in a conventional, everyday sense too), we group people in various ways — sex, race, creed. But these categories are truly imperfect summaries of reality. (Some more imperfect than others!) A lot of human diversity is not captured in data, including things we formerly thought of as clear-cut. Sex seems conveniently binary, but in reality it is multi-category, or maybe it’s a continuous variable. (Or maybe it’s too complex for a single variable.) In real life I don’t assume that when someone in the Registrar’s Office enters ‘N’ for Sex that the student’s data is merely missing. Because the N category is still such a small slice of the population I might treat it as missing, or reapportion it to either Male or Female as I do here. But that’s strictly for predictive modeling. It’s not a statement about transgendered or differently gendered people nor an opinion about where they “belong.”

24 April 2012

Data I want to play with

Filed under: Data, Fun — Tags: — kevinmacdonell @ 5:23 am

Guest post by Marianne M. Pelletier, Director of Advancement Research and Data Support, Cornell University

In my present job, I deal with a whole lot of data – over 2,000 fields of data on gifts, names, addresses, relationships, segmenting codes, dates, attributes, interests, contacts, you name it. Yet getting to play in this playground as a donor modeler only leaves me lusting for other kinds of data to play with, so much that my hobbies often lead me to places where data lives so I can fool with it. This short article is my wish list, whether or not I’ll ever get to mine any of it.

Horse Races are tracked to the umpteenth degree by handicappers. Buy a copy of the Daily Racing Form and you’ll see more statistics presented than you can read in a week. DRF also has a web page where you can download even more statistics – tracking the horses’ pedigree generations back in time and the jockey’s entire career, ride by ride. So what do I do? I spend some Sundays diligently typing key statistics into a homemade database, along with the race results, to see if I can find the regression formula that would make me more money than just following the program picks. The answer? So far, on maiden sprints on dirt, the horse that had the fastest workout is most likely to win. For every other kind of race, I’m still wishing to buy the data in a format I can manipulate instead of having to type it.

Speaking of gambling, I’d give my remaining eye tooth to play in Harrah’s data. Harrah is an incredibly good marketing firm, from offering me a free weekend to their new casino in some remote place to being the only game in town that offers $10 craps all weekend long. Imagine if you will getting to download affinity player card data and tracking where a person wanders in the casino – how many mix slots with table play? How many are single game players? What if the casino moved the buffet closer to Keno? What’s the best game to put right inside the valet parking entrance? Do the longer, red craps tables make one bet more or lose more? Or play longer? What is the average time for a player at a blackjack table? What if she’s drinking alcohol? What if she’s an awards card member? What if the player is male? What if the dealer is the same gender as the player? I’d be a kid in a candy store to get a contract to work data like that.

On the other side of the coin, what is the effect of parking availability on local business? Wouldn’t it be fun to figure out the dependent variable on that? Ithaca recently changed its parking rates from the first hour free to charging for every hour. Was it that or the longstanding recession that caused local businesses to disappear? Or is the turnover normal? Would I have to study when the students are in town vs. when they are gone? Would local businesses share their profit numbers with me?

And then there’s the whole thing about the best time of year to go to Disney World. I’d want to offer Disney a study of some kind (like, which ride should go next to the Small World ride?) in order to get data on when I’m most likely to enjoy good weather, a maximum number of rides open, and the fewest number of screaming children and strollers under my feet.

And speaking of flying somewhere, I’d love for Delta to hire me to study when people want to fly somewhere. All that Expedia/Travelocity search data – does anyone use it? After all, what if airlines could arrange that people in Boston can fly midmorning but people in New York can fly at night? What if there were one extra flight at 11:00 am from somewhere that would double an airline’s traffic because of the ripple effect? I’d love to be the one who discovers that.

Lastly, who can resist wishing to forecast forex? The currency exchange market is very likely very well tested by experts, but not by me. What if I could predict the day of week and time of day that the Euro drifts off against the dollar? I’d place my bet once a week and then go off to the casino. Or Disney. Or shopping. Oh, bother! It all looks like there’s data teeming everywhere, everywhere, and I’m only going to live so long.

1 February 2012

Where’s your institution on the Culture of Analytics Ladder?

Filed under: Fun, predictive analytics, Why predictive modeling? — Tags: , , — kevinmacdonell @ 2:21 pm

I’m laying on the couch with a bad head cold, and there’s a mix of snow and rain in the forecast. Time to curl up with my laptop and a cup of tea. I’ve got a question for you!

Not long ago I asked you to give me examples of institutions you’re aware of that are shining examples of institution-wide data-driven decision making. I was grateful for the responses, but no single institution was named twice. A few people offered an opinion about how their own organizations size up, which I found interesting.

So let’s explore that a bit more with a quick and anonymous poll: Where do you think your non-profit organization or institution fits on the Culture of Analytics Ladder? (That’s CoAL for short … but no need to be so formal. I totally made this up under the influence of cold medication.) Don’t over think it. Just pick whatever stage you feel your org or institution occupies.

The categories may seem a bit vague. If it’s any help, by “analysis” or “analytics” I am referring to the process of sifting through large quantities of data in search of patterns that lead to insights, primarily about your constituents. I am NOT referring to reporting. In fact I want you to ignore a lot of the day-to-day processes that involve data but are not really “analysis,” including: data entry, gift accounting, appeal segmentation, reporting on historical results, preparation of financials, and so on.

I am thinking more along the lines of modelling for the prediction of behaviours (which group of constituents are most likely to engage in such-and-so a behaviour?), prediction of future results (i.e., forecasting), open-ended exploration of constituent data in search of “clusters”, and and any other variety of data work that would be called on to make a decision about what to do in the future, as opposed to documenting what happened in the past. I am uncertain whether A/B split testing fits my definition of analysis, but let’s be generous and say that it does.

A couple of other pointers:

  • If you work for, say, a large university advancement department and aren’t sure whether analytics is used in other departments such as student admissions or recruitment, then answer just for your department. Same thing if you work for a regional office of a large non-profit and aren’t sure about the big picture.
  • If you have little or no in-house expertise, but occasionally hire a vendor to produce predictive modelling scores, then you might answer “6″ — but only if those scores are actually being well used.

Here we go.

17 November 2011

Add life to numbers with Excel colour scales

Filed under: Excel, Fun, Visualization — kevinmacdonell @ 7:21 am

For telling a story with data nothing beats a good visualization, which for most of us means a chart, usually created in Excel. Sometimes the situation seems to call for including the data in table format as well. You may include it after the chart, or throw it into an appendix; either way, you’re creating redundancy: The table and the chart are telling the same story.

The table offers accurate detail, but the person who’s trying to find the story in the data needs to study the table for a while to find it, even if there are only very few rows and columns. The chart is readily accessible (or should be), but probably trades away accurate detail to gain simplicity. Today I’ll show you that sometimes you can get the best of both in a single graphic.

Let’s start with a table that looks a bit complicated. This is copied from a study published on this blog by Peter Wylie and John Sammis (Is the online behaviour of your alums worth exploring?) For this example, it’s not essential that you understand what the chart is about. The important thing is to attempt to recognize patterns in the table. You can probably spot them, but only after studying the numbers and comparing values across cells, both up and down and from side to side. It’s not exactly “at a glance” comprehension.

The paper didn’t offer a chart for this table, but were we to express the data visually, we might have used something like the chart below. It’s a bit complex because of all the lines, but we can readily distinguish some key things about the Class Year deciles (that’s what “CY Decile” stands for). First of all, they’re arranged in order, from top to bottom, by age. The oldest decile starts high and stays up there, while the two youngest deciles languish at the bottom. All the rest are stacked in between. I haven’t told you what they are high IN, but let’s not worry about that for now.

Whether the chart is helpful or not, it’s still takes up extra room: One table, one chart, same data. What if we could COMBINE the two types of information — numerical and visual — into a single entity? Fortunately, newer versions of Excel offer the ability to include visual cues within the data table itself using conditional formatting. Here is the original table with colour added. Now what do you see?

That’s right, when cell are shaded according to the values they contain, the patterns are obvious. The lower the CY Decile value and the higher the Visits value, the greater is the number in the cell. Our eye is immediately drawn to the intersection of row ‘CY Decile 1′ and the column ’8 or more visits’. And as it turns out, this is a very special cell.

How so? Well, this table was created using data from a university that tracks how many times its alumni have visited its website via links sent out in university emails — the columns break down alumni by number of web visits. As well, for this analysis, all alumni were divided up into ten equal groups (deciles) by Class Year, the oldest being CY Decile 1, and the youngest CY Decile 10.

The values in the cells are actually percentages — the percentage of alumni in each group who have Lifetime Giving of $10,000 or more. No one should be surprised that the oldest alumni have the highest percentages, but the truly awesome finding is that frequent web visits are also associated with high levels of lifetime giving. (Anyone who has this data at their institution and fails to make use of it for prospect identification is obviously not interested in raising scads of money. Consider firing them.)

As I said, our eye is immediately drawn to the darkest cell, the 42.9% value. It’s almost like that cell is the “hottest.” Which leads me to the next version of the chart: a “heat map”. Sometimes we want to highlight both extremes, the highest AND the lowest values. It’s very intuitive to think of high values as “hot” and low values as “cool,” and colour them appropriately, like so:

Adding shading is quick and easy. Start by clicking and dragging to select the cells of your table that hold the values you want to jazz up. On the Home tab, click on the Conditional Formatting icon, and select Color Scales. Here you’ve got all sorts of options, one of which is the cold-to-hot scheme. Play with it to your heart’s content: Some shadings might be more appropriate for your data than others.

I can see this being useful for at-a-glance exploration of large tables in which patterns may be difficult to detect. For presenting your data to others, though, adding colour scales makes sense only when the physical layout of the cell values contains a visual pattern that you want to exploit to tell a story. Otherwise it’s just visual junk. The object is not just to make our charts pretty, but to make the data sing.

8 September 2011

Analytics goes to the movies

Filed under: Fun — Tags: , , — kevinmacdonell @ 5:25 am

I’m not a big baseball fan, but I am an analytics fan, so I’m curious to see how the concept is portrayed in the movie Moneyball, set to hit theatres on Sept. 23. The movie is based on the book of the same name by Michael Lewis. It tells the story of Billy Beane (played by Brad Pitt), the general manager of the Oakland A’s in the late 1990s, who needed to field a winning team on a meagre payroll.

You might already know about this analytics success story. Beane and his assistant general manager Paul DePodesta had an idea: There had to be champion players out there whom everyone else had overlooked; if no one was looking for them, they would be affordable. Their prospecting tool would be statistical analysis.

For example, their analysis showed that new stats such as on-base percentage and slugging percentage were better predictors of offensive success than the tried-and-true qualities valued by traditional buyers of talent. The analytical approach paid off, and it was soon copied by other major league baseball teams.

I don’t know if the movie will have much to do with analytics, but the trailer (click here to watch on YouTube), does seem to portray a geeky practitioner trying sell crusty, tradition-bound decision makers on the value of his craft. I like that story.

Better than that, though, is a line spoken by the fictional character Peter Brand (based on DePodesta and played by Jonah Hill): “Your goal shouldn’t be to buy players,” Brand says to Beane. “Your goal should be to buy wins. In order to buy wins, you need to buy runs.”

It’s a statement about success with statistical modeling: You have to start by properly framing the question. The rest is just technique.

The analytical approach doesn’t pay off for professional sports like it used to, because now everyone does it. (The Oakland A’s are not doing very well these days.) But analytics will always pay off in the nonprofit world because in the hunt for potential donors we still mostly compete against ourselves — we either identify our friends and supporters effectively, or we don’t.

17 December 2010

Chart frequency of words and terms from books with Google

Filed under: Coolness, Free stuff, Fun — Tags: , — kevinmacdonell @ 11:46 am

Here’s another interesting bauble from the nerds at Google. The Books Ngram Viewer allows you to plot the frequency of words and phrases that appeared in books published in the past few hundred years. Google estimates they’ve scanned and OCR’d more than 10 percent of all the books ever published, and this plotter is based on a sample of that data.

This “most excellent time-wasting tool” was blogged about by Alexis Madrigal, a senior editor for, in his post, The Decline of Man (as a Word), in which he shows how the word “man” has fared against “woman”. (Not well.) As Madrigal observes, this may not serve a legitimate research purpose, but it sure is fun.

Here’s a sample. I’ve searched for the term “database”, and set the years to search as 1950 to 2008. The y-axis shows the percentage of all the terms contained in Google’s sample of books written in English that are “database” for those years. As you can see, the word didn’t emerge in published sources before the early 1970s. (Click image for full size.)

The tool also allows you to plot the progress of one term against another. If you plot “database” against “data base”, you’ll see that the two-word term enjoyed a short life before the single word took over. I’ve been interested in the use of the word “gift” instead of “donation,” but the plot of those two words isn’t very informative due, I guess, to the many connotations of the word “gift.” Instead I plotted “charitable gift” and “charitable donation” to put the words in context, and came up with this chart. The concept of giving seems to have had quite a heyday up until around 1835, and “donation” was firmly in the lead. By 1880, though, it was all about the gift. (Click image for full size.)

That got me thinking about how well “philanthropy” has done through the years. Mentions before 1750 are rare, so I plotted from then to the present, and once again the first half the 19th century seems to have been relatively more preoccupied with the idea than later on. (Although, of course, who knows what data this is really based on. As I said, it’s fun, but I wouldn’t want to base a thesis on it without knowing more about the underlying data.)

Hmm – this IS fun. What if we plot poverty vs. religion vs. education? This doesn’t tell us what people were giving to, but it does give a glimpse into what they were writing about. “Poverty” has stayed relatively constant since 1750, but look at how “religion” has declined as “education” has risen. One line crosses the other right at 1909. Also interesting is that the trend started reversing direction about 10 years ago.

And finally, this chart plots “data mining” and two variations of “fundraising“. Data mining takes off as a published term in the early 1990s, and the term “fund raising” has merged into the single word, “fundraising.”

All sorts of fun. Try some for yourself! I’d be interested in hearing about any cool combos you come up with that relate to analytics and/or fundraising.

Older Posts »

The Silver is the New Black Theme Blog at


Get every new post delivered to your Inbox.

Join 975 other followers