CoolData blog

23 February 2011

Data mining and non-alumni NPOs

Filed under: Non-university settings — kevinmacdonell @ 12:43 pm

A year ago I wrote a blog post called Predictive modeling for non-university organizations, in which I suggested that non-profit organizations might look to for-profit enterprises to learn how to segment their databases more effectively, rather than institutions of higher education. I may not have come out and said it, but I was a bit tentative about whether serious data mining was even possible for non-profits: No alumni, no non-donor data, not enough data in general.

Well, that was a year ago, and I was wrong. Since then, I have worked with a large non-profit to build a predictive model that is every bit as good as one built from higher-ed alumni data. You can learn more about that experience, and what non-profits can do to get into predictive modeling, on Blackbaud’s Prospect Research Blog. I’ve written a guest post called Who says you need alumni? Check it out!

9 July 2010

How to infer age, when all you have is a name

Filed under: Coolness, External data, Non-university settings, Predictor variables — kevinmacdonell @ 6:02 am

I rarely post on a Friday, let alone a Friday in the middle of summer, but today’s cool idea is somewhat half-baked. Its very flakiness suits the day and the weather. Actually, I think it has potential, but I’m interested to know what others think.

For those of us in higher-ed fundraising, ‘age’ or ‘class year’ is a key predictor variable. Not everyone has this information in their databases, however. What if you could sort of impute a “best guess” age, based on a piece of data that you do have: First name?

Names go in and out of fashion. You may have played around with this cool tool for visualizing baby-name trends. My own first name, Kevin, peaked in popularity in the 1970s and has been on a downward slide ever since (chart here). I was born in 1969, so that’s pretty close. My father’s name, Leo, has not been popular since the 1920s (he was born in 1930), but is having a slight comeback in recent years (chart here).

As for female names, my mother’s name, Yvonne, never ranked in the top 1,000 in any time period covered by this visualization tool, so I’ll use my niece’s name: Katelyn. She was born in 2005. This chart shows that two common spellings of her name peaked around that year. (The axis labeling is a bit wonky — you’ll have to hover your cursor over the display to get a good read on the timing of the peak.)

You can’t look up every first name one by one, obviously, so you’ll need a data set from another source that relates relative frequencies of names with age data. That sort of thing might be available in census data. But knowing somebody with access to a higher-ed database might be the easiest way.

I’ve performed a query on our database, pulling on just three fields: ID (to ensure I have unique records), First Name, and Age — for more than 87,000 alumni. (Other databases will have only Class Year — we’re fortunate in that we’ve got birth dates for nearly every living alum.) Here are a few sample rows, with ID number blanked out:

From here, it’s a pretty simple matter to copy the data into stats software (or Excel) to compute counts and median ages for each first name. Amazingly, just six first names account for 10% of all living, contactable alumni! (In order: John, David, Michael, Robert, James, and Jennifer.)

On the other hand, a lot of first names are unique in the database, or nearly so. To simplify things a bit, I calculated median ages only for names represented five or more times in the database. These 1,380 first names capture the vast majority of alumni.

The ten “oldest” names in the database are listed in the chart below, in descending order by median age. Have a look at these venerable handles. Of these, only Max has staged a rebound in recent years (according to the Baby Names visualizer).

And here are the ten “youngest names,” in ascending order by median age. It’s an interesting coincidence that the very youngest name is Katelyn — my five-year-old niece. One or two (such as Jake) were popular many years ago, and at least one has flipped gender from male to female (Whitney). Most of the others are new on the scene.

The real test is, do these median ages actually provide reasonable estimates of age for people who aren’t in the database?

I’m not in the database (as an alum). There are 371 Kevins in the database, and their median age is 43. I turned 41 in May, so that’s very good.

My father is also not an alum. The 26 Leos in the database have a median age of 50, which is unfortunately 30 years too young. Let’s call that one a ‘miss’.

My mother’s ‘predicted’ age is off by half that — 15 years — that’s not too bad.

Here’s how my three siblings’ names fare: Angela (predicted 36, actual 39 — very good), Paul (predicted 48, actual 38 — fair), and Francis (predicted 60, actual 36 — poor). Clearly there’s an issue with Francis, which according to the Baby Names chart tool was popular many decades ago but not when my brother was named. In other words, results for individuals may vary!

So let’s say you’re a non-profit without access to age data for your database constituents. How does this help you? Well it doesn’t — not directly. You will need to find a data partner at a university who will prepare a file for you, just as I’ve done above. When you import the data into your model, you can match up records by first name and voila, you’ve got a variable that gives you a rough estimate of age. (Sometimes very rough — but it’s better than nothing.)

This is only an idea of mine. I don’t know if anyone has actually done this, so I’d be interested to hear from others. Here are a few additional thoughts:

  • There shouldn’t be any privacy concern — all you want is a list of first names and their median ages, NOT IDs or last names — but be sure to get all necessary approvals.
  • To anticipate your question, no, I won’t provide you my own file. I think you’d be much better off getting a names file from a university in your own city or region, which will provide a more accurate reflection of the ethnic flavour of your constituency.
  • I used “First name”, but of course universities collect first, middle and last names, and the formal first name might not be the preferred moniker. If the university database has a “Preferred first name” field that is fully populated, that might be a better option for matching with your first-name field.
  • Again, there might be more accessible sources of name- or age-related data out there. This idea just sounded fun to try!

22 April 2010

Introducing the nonprofit data collective

Filed under: Non-university settings — Tags: , , — kevinmacdonell @ 8:20 am

Yesterday I gave a conference presentation to a group of fundraisers, all but one of whom work for non-university nonprofits. Many have databases that are small, do not capture the right kinds of information to develop a model, or are unfit in any number of ways. But this group seemed highly attentive to what I was talking about, understood the concepts, and a few were eager to improve the quality of their data – and from there get into data mining someday.

The questions were all spot-on. One person asked how many database records one needed as a minimum for predictive modeling. I don’t know if there’s a pat answer for that one, but in any case I think my answer was discouraging. If you’re below a certain size threshold, you may not have any need for modeling at all. But the fact is, if you want to model mass behaviour, you need a lot of data.

So here’s a thought. What if a bunch of small- to mid-sized charities were to somehow get together and agree to submit their data to a centralized database? Before you object, hear me out.

Each charity would fund part of the salary of a database administrator and data-entry person, according to the proportion of the donor base they occupy in the data pool. The first benefit is that data would be entered and stored according to strict quality-control guidelines. Any time a charity required an address file for a particular mailing according to certain selection criteria, they’d just ask for it. The charity could focus on the delivery of their mission, not mucking around with a database they don’t fully know how to use.

The next benefit is scale. The records of donors to charities with related missions can be pooled for the purpose of building stronger predictive models than any one charity could hope to do on its own. Certain costs, such as list acquisition, could be shared and the benefits apportioned out. Some cross-promotion between causes could also occur, if charities found that to have a net benefit.

Maybe charities would not choose to cede control of their data. Maybe there are donor privacy concerns that I’m overlooking. It’s just an idea. My knowledge of the nonprofit sector outside of universities is limited – does anyone know of an example of this idea in use today?

P.S. (18 Feb 2011): This post by Jim Berigan on the Step by Step Fundraising blog is a step in the right direction: 5 Reasons You Should Collaborate with Another Non-profit in 2011.

11 March 2010

Predictive modeling for non-university organizations

Filed under: Non-university settings — Tags: , , — kevinmacdonell @ 6:57 pm

Sometimes I think I have it too easy. Those of us working in post-secondary education advancement have so much biographical and other data to work with! There’s class year, abundant contact information, often employment information, and all the data generated by an individual’s willing engagement with a chosen institution and peer group. There is little need to collect anything external to provide additional predictors. (In fact, the external info I’ve been able to pair up with my data has had zero predictive value, in comparison.)

Not so, presumably, for those of you who work for other non-profits. Not only is your data more fragmentary, but you’re probably missing a key element: non-donors. You can’t distinguish the characteristics of a donor from those of a non-donor if both animals are not fully present in your database. (In a university database, a majority of constituents, usually, are non-donors.)

I admit, I have no idea what a donor database for, say, a hospital foundation looks like. If it’s strictly a DONOR database, then it’s got no non-donors in it. If it’s a PATIENT database (to which you append donor info), then obviously you’re in better shape. Still, given the understandable privacy restrictions that must be in place, predicting donor behaviour must be a bigger challenge.

I am not surprised, though, that some of the leaders in the field of predictive modeling work in hospital fundraising. I recently had the pleasure of having a long phone conversation with Kate Chamberlin at Memorial Sloan-Kettering Cancer Center, in New York. Her team of analysts (yes! her team! the idea makes one swoon) are not only proving that predictive modeling has a place outside of the alumni-type database, but they’re excelling at it.

The Chronicle of Philanthropy ran an excellent piece back in January, A New York Cancer Center Uses Technology to Predict Who Will Give, which begins: “Almost every charity’s pool of donors includes plenty of people who have both the means and the inclination to make a far bigger gift than they ever did in the past. The trick, of course, is to figure out just which people will make the leap. To that end, Memorial Sloan-Kettering Cancer Center, in New York, has become… “

(And that’s it, unless you have a subscription. Unfortunately, I’m linking to it a little late.)

Sure, they’re big. But they’re successful because they’re smart. And they share their smarts at conferences. My advice to university data miners who plan to attend a conference this year: Seek out the non-university presenters and see what they have to say. Chances are they’ve got some of the most creative approaches, because they have to.

As for non-university readers who are drawn to what university advancement departments are doing: I wonder if you would be better off looking to the private sector for inspiration, rather than us university types?

I’m thinking of big retailers with data-rich selling environments such as an online bookstore, or a hardware chain with a strong loyalty-reward program. Most people in the database, for example, must have bought at least one book at some point in the past. Just as an arts-based nonprofit has no non-donors in their database, a bookseller would have no non-customers. The bookseller would attempt to predict who is most likely to make more purchases, versus who will buy once and abandon their account, and target accordingly. The store could withhold expensive incentives from the high scorers (since they will keep buying anyway), and from the low scorers (since they’re unlikely to respond in numbers), and target the middle instead.

Obviously, no charity would leave their high scorers untended that way – there is always some higher goal to reach. And of course there are big differences. But in terms of technique, maybe there are congruities to explore.

Chronicle of Philanthropy:A New York Cancer Center Uses Technology to Predict Who Will Give

The story was posted January 7 and was available free for a while, but you will need a subscription now.

It begins, “Almost every charity’s pool of donors includes plenty of people who have both the means and the inclination to make a far bigger gift than they ever did in the past. The trick, of course, is to figure out just which people will make the leap. To that end, Memorial Sloan-Kettering Cancer Center, in New York, has become… “

The Silver is the New Black Theme Blog at


Get every new post delivered to your Inbox.

Join 977 other followers