CoolData blog

15 January 2013

The cautionary tale of Mr. S. John Doe

A few years ago I met with an experienced Planned Giving professional who had done very well over the years without any help from predictive modeling, and was doing me the courtesy of hearing my ideas. I showed this person a series of charts. Each chart showed a variable and its association with the condition of being a current Planned Giving expectancy. The ultimate goal would have been to consolidate these predictors together as a score, in order to discover new expectancies in that school’s alumni database. The conventional factors of giving history and donor loyalty are important, I conceded, but other engagement-related factors are also very predictive: student activities, alumni involvement, number of degrees, event attendance, and so on.

This person listened politely and was genuinely interested. And then I went too far.

One of my charts showed that there was a strong association between being a Planned Giving expectancy and having a single initial in the First Name field. I noted that, for some unexplained reason, having a preference for a name like “S. John Doe” seemed to be associated with a higher propensity to make a bequest. I thought that was cool.

The response was a laugh. A good-natured laugh, but still — a laugh. “That sounds like astrology!”

I had mistaken polite interest for a slam-dunk, and in my enthusiasm went too far out on a limb. I may have inadvertently caused the minting of a new data-mining skeptic. (Eventually, the professional retired after completing a successful career in Planned Giving, and having managed to avoid hearing much more about predictive modeling.)

At the time, I had hastened to explain that what we were looking at were correlations — loose, non-causal relationships among various characteristics, some of them non-intuitive or, as in this case, seemingly nonsensical. I also explained that the linkage was probably due to other variables (age and sex being prime candidates). Just because it’s without explanation doesn’t mean it’s not useful. But I suppose the damage was done. You win some, you lose some.

Although some of the power (and fun) of predictive modeling rests on the sometimes non-intuitive and unexplained nature of predictor variables, I now think it’s best to frame any presentation to a general audience in terms of what they think of as “common sense”. Limiting, yes. But safer. Unless you think your listener is really picking up what you’re laying down, keep it simple, keep it intuitive, and keep it grounded.

So much for sell jobs. Let’s get back to the data … What ABOUT that “first-initial” variable? Does it really mean anything, or is it just noise? Is it astrology?

I’ve got this data set in front of me — all alumni with at least some giving in the past ten years. I see that 1.2% percent of all donors have a first initial at the front of their name. When I look at the subset of the records that are current Planned Giving expectancies, I see that 4.6% have a single-initial first name. In other words, Planned Giving expectancies are almost four times as likely as all other donors to have a name that starts with a single initial. The data file is fairly large — more than 17,000 records — and the difference is statistically significant.

What can explain this? When I think of a person whose first name is an initial and who tends to go by their middle name, the image that comes to mind is that of an elderly male with a higher than average income — like a retired judge, say. For each of the variables Age and Male, there is in fact a small positive association with having a one-character first name. Yet, when I account for both ‘Age’ and ‘Male’ in a regression analysis, the condition of having a leading initial is still significant and still has explanatory power for being a Planned Giving expectancy.

I can’t think of any other underlying reasons for the connection with Planned Giving. Even when I continue to add more and more independent variables to the regression, this strange predictor hangs in there, as sturdy as ever. So, it’s certainly interesting, and I usually at least look at it while building models.

On the other hand … perhaps there is some justification for the verdict of “astrology” (that is, “nonsense”). The data set I have here may be large, but the number of Planned Giving expectancies is less than 500 — and 4.6% of 500 is not very many records. Regardless of whether p ≤ 0.0001, it could still be just one of those things. I’ve also learned that complex models are not better than simple ones, particularly when trying to predict something hard like Planned Giving propensity. A quirky variable that suggests no potential causal pathway makes me wary of the possibility of overfitting the noise in my data and missing the signal.

Maybe it’s useful, maybe it’s not. Either way, whether I call it “cool” or not will depend on who I’m talking to.

13 November 2012

Making a case for modeling

Guest post by Peter Wylie and John Sammis

(Click here to download post as a print-friendly PDF: Making a Case for Modeling – Wylie Sammis)

Before you wade too far into this piece, let’s be sure we’re talking to the right person. Here are some assumptions we’re making about you:

  • You work in higher education advancement and are interested in analytics. However, you’re not a sophisticated stats person who throws around terms like regression and cluster analysis and neural networks.
  • You’re convinced that your alumni database (we’ll leave “parents” and “friends” for a future paper) holds a great deal of information that can be used to pick out the best folks to appeal to — whether by mail, email, phone, or face-to-face visits.
  • Your boss and your boss’s bosses are, at best, less convinced than you are about this notion. At worst, they have no real grasp of what analytics (data mining and predictive modeling) are. And they may seem particularly susceptible to sales pitches from vendors offering expensive products and services for using your data – products and services you feel might cause more problems than they will solve.
  • You’d like to find a way to bring these “boss” folks around to your way of thinking, or at least move them in the “right” direction.

If we’ve made some accurate assumptions here, great. If we haven’t, we’d still like you to keep reading. But if you want to slip out the back of the seminar room, not to worry. We’ve done it ourselves more times than you can count.

Okay, here’s something you can try:

1. Divide the alums at your school into ten roughly equal size groups (deciles) by class year. Table 1 is an example from a medium sized four year college.

Table 1: Class Years and Counts for Ten Roughly Equal Size Groups (Deciles) of Alumni at School A

2. Create a very simple score:


This score can assume three values: “0, “1”, or “2.” A “0” means the alum has neither an email nor a home phone listed in the database. A “1” means the alum has either an email listed in the database or a home phone listed in the database, but not both. A “2” means the alum has both an email and a home phone listed in the database.

3. Create a table that contains the percentage of alums who have contributed at least $1,000 lifetime to your school for each score level for each class year decile. Table 1 is an example of such a table for School A.

Table 2: Percentage of Alumni at Each Simple Score Level at Each Class Year Decile Who Have Contributed at Least $1,000 Lifetime to School A


4. Create a three dimensional chart that conveys the same information contained in the table. Figure 1 is an example of such a chart for School A.

In the rest of this piece we’ll be showing tables and charts from seven other very diverse schools that look quite similar to the ones you’ve just seen. At the end, we’ll step back and talk about the importance of what emerges from these charts. We’ll also offer advice on how to explain your own tables and charts to colleagues and bosses.

If you think the above table and chart are clear, go ahead and start browsing through what we’ve laid out for the other seven schools. However, if you’re not completely sure you understand the table and the chart, see if the following hypothetical questions and answers help:

Question: “Okay, I’m looking at Table 2 where it shows 53% for alums in Decile 1 who have a score of 2. Could you just clarify what that means?”

Answer. “That means that 53% of the oldest alums at the school who have both a home phone and an email listed in the database have given at least $1,000 lifetime to the school.”

Question. “Then … that means if I look to the far left in that same row where it shows 29% … that means that 29% of the oldest alums at the school who have neither a home phone nor an email listed in the database have given at least $1,000 lifetime to the school?”

Answer. “Exactly.”

Question. “So those older alums who have a score of 2 are way better givers than those older alums who have a score of 0?”

Answer. “That’s how we see it.”

Question. “I notice that in the younger deciles, regardless of the score, there are a lot of 0 percentages or very low percentages. What’s going on there?”

Answer. “Two things. One, most younger alums don’t have the wherewithal to make big gifts. They need years, sometimes many years, to get their financial legs under them. The second thing? Over the last seven years or so, we’ve looked at the lifetime giving rates of hundreds and hundreds of four-year higher education institutions. The news is not good. In many of them, well over half of the solicitable alums have never given their alma maters a penny.”

Question. “So, maybe for my school, it might be good to lower that giving amount to something like ‘has given at least $500 lifetime’ rather than $1,000 lifetime?”

Answer. Absolutely. There’s nothing sacrosanct about the thousand dollar level that we chose for this piece. You can certainly lower the amount, but you can also raise the amount. In fact, if you told us you were going to try several different amounts, we’d say, “Fantastic!”

Okay, let’s go ahead and have you browse through the rest of the tables and charts for the seven schools we mentioned earlier. Then you can compare your thoughts on what you’ve seen with what we think is going on here.

(Note: After looking at a few of the tables and charts, you may find yourself saying, “Okay, guys. Think I got the idea here.” If so, go ahead and fast forward to our comments.)

Table 3: Percentage of Alumni at Each Simple Score Level at Each Class Year Decile Who Have Contributed at Least $1,000 Lifetime to School B


Table 4: Percentage of Alumni at Each Simple Score Level at Each Class Year Decile Who Have Contributed at Least $1,000 Lifetime to School C

Table 5: Percentage of Alumni at Each Simple Score Level at Each Class Year Decile Who Have Contributed at Least $1,000 Lifetime to School D

Table 6: Percentage of Alumni at Each Simple Score Level at Each Class Year Decile Who Have Contributed at Least $1,000 Lifetime to School E

Table 7: Percentage of Alumni at Each Simple Score Level at Each Class Year Decile Who Have Contributed at Least $1,000 Lifetime to School F

Table 8: Percentage of Alumni at Each Simple Score Level at Each Class Year Decile Who Have Contributed at Least $1,000 Lifetime to School G

Table 9: Percentage of Alumni at Each Simple Score Level at Each Class Year Decile Who Have Contributed at Least $1,000 Lifetime to School H

Definitely a lot of tables and charts. Here’s what we see in them:

  • We’ve gone through the material you’ve just seen many times. Our eyes have always been drawn to the charts; we use the tables for back-up. Even though we’re data geeks, we almost always find charts more compelling than tables. That is most certainly the case here.
  • We find the patterns in the charts across the seven schools remarkably similar. (We could have included examples from scores of other schools. The patterns would have looked the same.)
  • The schools differ markedly in terms of giving levels. For example, the alums in School C are clearly quite generous in contrast to the alums in School F. (Compare Figure 3 with Figure 6.)
  • We’ve never seen an exception to one of the obvious patterns we see in these data: The longer alums have been out of school, the more money they have given to their school.
  • The “time out of school” pattern notwithstanding, we continue to be taken by the huge differences in giving levels (especially among older alums) across the levels of a very simple score. School G is a prime example. Look at Figure 7 and look at Table 8. Any way you look at these data, it’s obvious that alums who have even a score of “1” (either a home phone listed or an email listed, but not both) are far better givers than alums who have neither listed.

Now we’d like to deal with an often advanced argument against what you see here. It’s not at all uncommon for us to hear skeptics say: “Well, of course alumni on whom we have more personal information are going to be better givers. In fact we often get that information when they make a gift. You could even say that amount of giving and amount of personal information are pretty much the same thing.”

We disagree for at least two reasons:

Amount of personal information and giving in any alumni database are never the same thing. If you have doubts about our assertion, the best way to dispel those doubts is to look in your own alumni database. Create the same simple score we have for this piece. Then look at the percentage of alums for each of the three levels of the score. You will find plenty of alums who have a score of 0 who have given you something, and you will find plenty of alums with a score of 2 who have given you nothing at all.

We have yet to encounter a school where the IT folks can definitively say how an email address or a home phone number got into the database for every alum. Why is that the case? Because email addresses and home phone numbers find their way into alumni database in a variety of ways. Yes, sometimes they are provided by the alum when he or she makes a gift. But there are other ways. To name a few:

  • Alums (givers or not) can provide that information when they respond to surveys or requests for information to update directories.
  • There are forms that alums fill out when they attend a school sponsored event that ask for this kind of information.
  • There are vendors who supply this kind of information.

Now here’s the kicker. Your reactions to everything you’ve seen in this piece are critical. If you’re going to go to a skeptical boss to try to make a case for scouring your alumni database for new candidates for major giving, we think you need to have several reactions to what we’ve laid out here:

1. “WOW!” Not, “Oh, that’s interesting.” It’s gotta be, “WOW!” Trust us on this one.

2. You have to be champing at the bit to create the same kinds of tables and charts that you’ve seen here for your own data.

3. You have to look at Table 2 (that we’ve recreated below) and imagine it represents your own data.

Table 2: Percentage of Alumni at Each Simple Score Level at Each Class Year Decile Who Have Contributed at Least $1,000 Lifetime to School A

Then you have to start saying things like:

“Okay, I’m looking at the third class year decile. These are alums who graduated between 1977 and 1983. Twenty-five percent of them with a score of 2 have given us at least $1,000 lifetime. But what about the 75% who haven’t yet reached that level? Aren’t they going to be much better bets for bigger giving than the 94% of those with a score of 0 who haven’t yet reached the $1,000 level?”

“A score that goes from 0 to 2? Really? What about a much more sophisticated score that’s based on lots more information than just email listed and home phone listed? Wouldn’t it make sense to build a score like that and look at the giving levels for that more sophisticated score across the class year deciles?”

If your reactions have been similar to the ones we’ve just presented, you’re probably getting very close to trying to making your case to the higher-ups. Of course, how you make that case will depend on who you’ll be talking to, who you are, and situational factors that you’re aware of and we’re not. But here are a few general suggestions:

Your first step should be making up the charts and figures for your own data. Maybe you have the skills to do this on your own. If not, find a technical person to do it for you. In addition to having the right skills, this person should think doing it would be cool and won’t take forever to finish it.

Choose the right person to show our stuff and your stuff to. More and more we’re hearing people in advancement say, “We just got a new VP who really believes in analytics. We think she may be really receptive to this kind of approach.” Obviously, that’s the kind of person you want to approach. If you have a stodgy boss in between you and that VP, find a way around your boss. There’s lots of ways to do that.

Do what mystery writers do; use the weapon of surprise. Whoever the boss you go to is, we’d recommend that you show them this piece first. After you know they’ve read it, ask them what they thought of it. If they say anything remotely similar to: “I wonder what our data looks like,” you say, “Funny you should ask.”

Whatever your reactions to this piece have been, we’d love to hear them.

13 June 2012

Finding predictors of future major givers

Guest post by Peter B. Wylie and John Sammis

(Download a print-friendly .pdf version here: Finding Predictors of Future Major Givers)

For years a bunch of  committed data miners (we’re just a couple of them) have been pushing, cajoling, exhorting, and nudging  folks in higher education advancement to do one thing: Look as hard at their internal predictors of major giving as they look at outside predictors (like social media and wealth screenings). It seems all that drum beating has been having an effect. If you want some evidence of that, take a gander at the preconference presentations that will be given this August in Minneapolis at the APRA 25th Annual International Conference. It’s an impressive list.

So…what if you count yourself among the converted? That is, you’re convinced that looking at internal predictors of major giving is a good idea. How do you do that? How do you do that, especially if you’re not a member of that small group of folks who:

  • have a solid knowledge of applied statistics as used in both the behavioral sciences and “business intelligence?”
  • know a good bit about topics like multiple regression, logistic regression, factor analysis, and cluster analysis?
  • are practiced in the use of at least one stats application whether it’s SPSS, SAS, Data Desk, or R or some other open source option?
  • are actively doing data mining and predictive modeling on a weekly, if not daily basis?

The answer, of course, is that there is no single, right and easy way to look for predictors of major giving. What you’ll see in the rest of this piece is just one way we’ve come up with – one we hope you’ll find helpful.

Specifically, we’ll be covering two topics:

  • The fact that the big giving in most schools does not begin until people are well into their fifties, if not their sixties
  • A method for looking at variables in an alumni database that may point to younger alums who will eventually become very generous senior alums


Where The Big Money Starts

Here we’ll take you through the steps we followed to show that the big giving in most schools does not begin until alums are well into their middle years.

Step 1: The Schools We Used

We chose six very different schools (public and private, large and small) spread out across North America. For five of the schools, we had the entire alumni database to work with. With one school we had a random sample of more than 20,000 records.

Step 2: Assigning An Age to Every Alumni Record

Using Preferred class year, we computed an estimate of each alum’s age with this formula:

Age = 2012 – preferred class year + 22

Given the fact that many students graduate after the age of 22, it’s safe to assume that the ages we assigned to these alums are  slight to moderate underestimates of their true ages.

Step 3: Computing The Percentage of  The Sum of Lifetime Dollars Contributed by Each Alum

For all the records in each database, we computed each alum’s percentage of the sum of lifetime dollars contributed by all solicitable alums (those who are living and reachable). To do this computation, we divided each alum’s lifetime giving by the sum of lifetime giving for the entire database and converted that value to a percentage.

For example, let’s assume that the sum of lifetime giving for the solicitable alums in a hypothetical database is $50 million. Table 1 shows both the lifetime giving and the percent of the sum of lifetime giving for three different records:

Table 1: Lifetime Giving and Pecentage of The Sum of All Lifetime Giving for Three Hypothetical Alums

Just to be clear:

  • Record A has given no money at all to the school. That alum’s percentage is obviously 0.
  • Record B has given $39,500 to the school. That alum’s percentage is 0.079% of $50 million.
  • Record C has given $140,500 to the school. That alum’s percentage is 0.280% of $50 million.

Step 4: Computing The Percentage and The Cumulative Percentage of The Sum of Lifetime Dollars Contributed by Each of 15 Equal-Sized Age Groups of  Alums

For each of the six schools, we divided all alums into 15 roughly equal-sized age goups. These groups ranged from alums in their early twenties to those who had achieved or passed the century mark.

To make this all clear we have used School A (whose alums have given a sum of $164,215,000) as an example. Table 2 shows the:

  • total amount of lifetime dollars contributed by each of these age groups in School A
  • the percentage of the $164,215,000 contributed by these groups
  • the cumulative percentage of the $164,215,000 contributed by alums up to and including a certain age group

Table 2: Lifetime Giving, Percent of Sum of Lifetime Giving, and Cumulative Percent of Sum of Lifetime Giving for Fifteen Equal-Size Age Groups In School A

Here are some things that stand out for us in this table:

  • All alums 36 and younger have contributed less than 1% of the sum of lifetime givng.
  • For all alums under age 50 the cumulative amount given is just over 7% of the sum of lifetime givng.
  • For all alums under age 62 the cumulative amount given is less than 30% of the sum of lifetime givng.
  • For all alums under age 69 the cumulative amount given is slightly more than 40% of the sum of lifetime givng.
  • Well over 55% of the sum of lifetime givng has come in from alums who are 69 and older.

The big news in this table, of course, is that the lion’s share of  money in School A has come in from alums who have long since passed the age of eligibility for collecting Social Security. Not a scintilla of doubt about that.

But what about all the schools we’ve looked at? Do they show a similar pattern of giving by age? To help you decide, we’ve constructed Figues 1 – 6 that provide the same information as you see in the rightmost column of Table 2: The cumulative percentage of all lifetime giving contributed by alums up to and including a certain age group.

Since Figure 1 below captures the same information you see in the rightmost column of Table 2, you don’t need to spend a lot of time looking at it.

But we’d recommend taking your time looking at Figures 2-6. Once you’ve done that, we’ll tell you what we see.

These are the details of what we see for Schools B-F:

  • School B: Alums 48 and younger have contributed less than 5% of the sum of lifetime giving. Alums 70 and older have contributed almost 40% of the sum.
  • School C: Alums 52 and younger have contributed less than 5% of the sum. Alums 70 and older have contributed more than 40% of the sum.
  • School D: Alums 55 and younger have contributed less than 30% of the sum. Alums 70 and older have contributed almost 45% of the sum.
  • School E: Alums 50 and younger have contributed less than 30% of the sum. Alums 61 and older have contributed more than 40% of the sum.
  • School F: Alums 50 and younger have contributed less than 20% of the sum. Alums 68 and older have contributed well over 50% of the sum.

The big picture? It’s the same phenomenon we saw with School A: The big money has come in from alums who are in the “third third” of their lives.

One Simple Way To Find Possible Predictors of The Big Givers on The Horizon

Up to this point we’ve either made our case or not that the big bucks don’t start coming in from alumni until they reach their late fifties or sixties. Great, but how do we go about identifying those alums in their forties and early fifties who are likely to turn into those very generous older alums?

It’s a tough question. In our opinion, the most rigorous scientific way to answer the question is to set up a longitudinal study that would involve:

  1. Identifying all the alums in a number of different schools who are in the forties and early fifties category.
  2. Collecting all kinds of data on these folks including giving history, wealth screening and other gift capacity information, biographic information, as well as a host of fields that are included in the databases of these schools like contact information, undergraduate activities, and on and on the list would go.
  3. Waiting about ten or fifteen years until these “youngsters” become “oldsters” and see which of all that data collected on them ends up predicting the big givers from everybody else.

Well, you’re probably saying something like, “Gentlemen, surely you jest. Who the heck is gonna wait ten or fifteen years to get the answers? Answers that may be woefully outdated given how fast society has been changing in the last twenty-five years?”

Yes, of course. So what’s a reasonable alternative? The idea we’ve come up with goes something like this: If we can find variables that differentiate current, very generous older alums from less generous alums, then we can use those same variables to find younger alums who “look like” the older generous alums in terms of those variables.

To bring this idea alive, we chose one school of the six that has particularly good data on their alums. Then we took these steps:

We divided alums 57 and older into ten roughly equal size groups (deciles) by their amount of lifetime giving. Figure 7 shows the median lifetime giving for these deciles.

Table 3 gives a bit more detailed information about the giving levels of these deciles, especially the total amount of lifetime giving.

Table 3: Sum of Lifetime Dollars and Median Lifetime Dollars for 10 Equal Sized Groups of Alums 57 and Older

We picked these eight variables to compare across the deciles:

  • number of alums who have a business phone listed in the database
  • number of alums who participated in varsity athletics
  • number of alums who were a member of a greek organization as an undergraduate
  • number of alums who have an email address listed in the database
  • number of logins
  • number of reunions attended
  • number of  years of volunteering
  • number of events attended

Before we take you through Figures 8-14, we should say that the method we’ve chosen to compare the deciles on these variables is not the way a stats professor nor an experinced data miner/modeler would recommend you do the comparisons. That’s okay. We were aiming for clarity here.

Let’s go through the figures. We’ve laid them out in order from “not so hot” variables to “pretty darn good” variables.

It’s pretty obvious when you look at Fig. 8 that bigger givers, for the most part, are no more likely to have a business phone listed in the database than are poorer givers.

Varsity athletics? Yes, there’s a little bit of a trend here, but it’s not a very consistent trend. We’re not impressed.

This trend is somewhat encouraging. Good givers are more likely to have been a member of a Greek organization as an undergraduate than not so good givers. But we would not rate this one as a real good predictor.

Now we’re getting somewhere. Better givers are clearly more likely to have an e-mail address listed in the database than are poorer givers.

This one gets our attention. We’re particularly impressed with the difference in the number of logins for Decile 10 (really big givers) versus the number of logins for the lowest two deciles. At this school they should be paying attention to this variable (and they are).

This figure is pretty consistent with what we’ve found across many, many schools. It’s a good example of why we are always encouraging higher ed institutions to store reunion data and pay attention to it.

This one’s a no-brainer.

And this one’s a super no-brainer.

Where to Go from Here

After you read something like this piece, it’s natural to raise the question: “What should I do with this information?”  Some thoughts:

  • Remember, we’re not assuming that you’re a sophisticated data miner/modeler. But we are assuming that you’re interested in looking at your data to help make better decisions about raising money.
  • Without using any fancy stats software and with a little help from your advancement services folks, you can do the same kind of analysis with your own alumni data as we’ve done here. You’ll run into a few roadblocks, but you can do it. We’re convinced of that.
  • Once you’ve done this kind of an analysis you can start looking at some of your alums who are in their forties and early fifiteies who haven’t yet jumped up to a high level of giving. The ones who look like their older counterparts with respect to logins, or reunion attendance, or volunteering (or whatever good variables you’ve found)? They’re the ones worth taking a closer look at.
  • You can take your analysis and show it to someone at a higher decision-making level than your own. You can say, “Right now, I don’t know how to turn all this stuff into a predictive model. But I’d like to learn how to do that.” Or you can say, “We need to get someone in here who has the skills to turn this kind of information into a tool for finding these people who are getting ready to pop up to a much higher level of giving.”
  • And after you have become comfortable with these initial explorations of your data we encourage you to consider the next step – predictive modeling based on those statistics terms we mentioned earlier. It is not that hard. Find someone to help you – your school has lots of smart people – and give it a try. The resulting scores will go a long way toward identifying your future big givers.

As always: We’d love to get your thoughts and reactions to all this.

21 February 2012

Putting an age-guessing trick to the test

Filed under: Alumni, Predictor variables — Tags: , , , , — kevinmacdonell @ 5:53 am

This question came to me recently via email: What’s a good way to estimate the age of database constituents, when a birth date is missing? The person who asked me wanted to use ‘age’ in some predictive models for giving, but was missing a lot of birth date data.

This is an interesting problem, and finding an answer to it has practical implications. Age is probably the most significant predictor in most giving models. It might be negative in a donor-acquisition model, but positive in almost any other type (renewal, upgrade, major giving). For those of us in higher ed, ‘year of graduation’ is a good proxy for age just as it is. But if you want to include non-degreed alumni (without an ‘expected year of graduation’), friends of the university who are not spouses (you can guess spouse ages somewhat accurately), or other non-graduates, or if you work for a nonprofit or business that has only partial age data, then you might need to get creative.

Here’s a cool idea: A person’s first name can be indicative of his or her probable age. Most first names have varied widely in popularity over the years, and you can take advantage of that fact. Someone named Eldred is probably not a 20-something, while someone named Britney is probably not a retiree. Finding out what they probably ARE is something I’ve written about here: How to infer age when all you have is a name.

It’s simple. If you have age data for at least a portion of your database:

  1. Pull all first names of living individuals from your database, with their ages.
  2. Calculate the average (or median) age for each first name. (Example: The median age of the 371 Kevins in our database is 43.) This is a job for stats software.
  3. For any individual who is missing an age, assign them the average (or median) age of people with the same first name.

When I wrote my first post on this topic, I put the idea out there but didn’t actually test it. It sounds approximate and unreliable, but I didn’t test it because I have no personal need for guessing ages: I’ve got birth dates for nearly every living alum.

Today I will address that omission.

I pulled a data file of about 104,000 living alumni, excluding any for whom we don’t have a birth date. All I requested was ID, First Name, and Age. (I also requested the sum of lifetime giving for each record, but I’ll get to that later.) I pasted these variables into a stats package (Data Desk), and then split the file into random halves of about 52,000 records each. I used only the first half to calculate the average age for each unique first name, rounding the average to the nearest whole number.

I then turned my attention to the ‘test’ half of the file. I tagged each ID with a ‘guessed age’, based on first name, as calculated using the first half of the file.

How did the guessed ages compare with peoples’ real ages?

I guessed the correct age for 3.5% of people in the test sample. That’s may not sound great, but I didn’t expect to be exactly right very often: I expected to be in the ballpark. In 17.5% of cases, I was correct to within plus or minus two years. In 37.6% of cases, I was correct to within plus or minus five years. And in 63.5% of cases, I was correct to within plus or minus 10 years. That’s the limit of what I would consider a reasonable guess. For what it’s worth, in order to reach 80% of cases, I would need to expand the acceptable margin of error to plus or minus 15 years — a span of 30 years is a bit too broad to consider “close”.

I also calculated median age, just in case the median would be a better guess than the average. This time, I guessed the correct age in 3.7% of cases — just a little better than when I used the average, which was also true as I widened the margin of error. In 18.5% of cases, I was correct to within plus or minus two years. In 38.8% of cases, I was correct to within plus or minus five years. And in 64.1% of cases, I was correct to within plus or minus 10 years. So not much of a difference in accuracy between the two types of guesses.

Here’s a chart showing the distribution of errors for the test half of the alumni sample (Actual Age minus Guessed Age), based on the median calculation:

The distribution seems slightly right-skewed, but in general a guess is about as likely to be “too old” as “too young.” Some errors are extreme, but they are relatively few in number. That has more to do with the fact that people live only so long, which sets a natural limit on how wrong I can be.

Accuracy would be nice, but a variable doesn’t need to be very accurate to be usable in a predictive model. Many inputs are not measured accurately, but we would never toss them out for that reason, if they were independent and had predictive power. Let’s see how a guessed-age variable compares to a true-age variable in a regression analysis. Here is the half of the sample for whom I used “true age”:

The dependent variable is ‘lifetime giving’ (log-transformed), and the sole predictor is ‘age’, which accounts for almost 15% of the variability in LTG (as we interpret the R squared statistic). It’s normal for age to play a huge part in any model trained on lifetime giving.

Now we want to see the “test” half, for whom we only guessed at constituents’ ages. Here is a regression using guessed ages (based on the average age). The variable is named “avg age new”:

This tells me that a guessed age isn’t nearly as reliable as the real thing, which is not a big surprise. The model fit has dropped to an R squared of only .05 (5%). Still, that’s not bad. As well, the p-value is very small, which suggests the variable is significant, and not some random effect. It’s a lot better than having nothing at all.

Finally, for good measure, here’s another regression, this time using median age as the predictor. The result is practically the same.

If I had to use this trick, I probably would. But will it help you? That depends. What is significant in my model might not be in yours, and to be honest, with the large sample I have here, achieving “significance” isn’t that hard. If three-quarters of the records in your database are missing age data, this technique will give only a very fuzzy approximation of age and probably won’t be all that useful. If only one-quarter are missing, then I’d say go for it: This trick will perform much better than simply plugging in a constant value for all missing ages (which would be one lazy approach to missing data needed for a regression analysis).

Give it a try, and have fun with it.

P.S.: A late-coming additional thought. What if you compare these results with simply plugging in the average or median age for the sample? Using the sample average (46 years old):

  • Exact age: correct 2.2% of the time (compared to 3.5% for the first-name trick)
  • Within +/- 2 years: correct 11.1% of the time (compared to 17.5%)
  • Within +/- 5 years: correct 24.4% of the time (compared to 37.6%)
  • Within +/- 10 years: correct 46.5% of the time (compared to 63.5%)

Plugging in the median instead hardly makes a difference in age-guessing accuracy. So, the first-name trick would seem to be an improvement.

16 January 2012

Address updates and affinity: Consider the source

Filed under: Correlation, Predictor variables, skeptics — Tags: , , , , — kevinmacdonell @ 1:03 pm

Some of the best predictors in my models are related to the presence or absence of phone numbers and addresses. For example, the presence of a business phone is usually a highly significant predictor of giving. As well, a count of either phone or address updates present in the database is also highly correlated with giving.

Some people have difficulty accepting this as useful information. The most common objection I hear is that such updates can easily come from research and data appends, and are therefore not signals of affinity at all. And that would be true: Any data that exists solely because you bought it or looked it up doesn’t tell you how someone feels about your institution. (Aside from the fact that you had to go looking for them in the first place — which I’ve observed is negatively correlated with giving.)

Sometimes this objection comes from someone who is just learning data mining. Then I know I’m dealing with someone who’s perceptive. They obviously get it, to some degree — they understand there’s potentially a problem.

I’m less impressed when I hear it from knowledgeable people, who say they avoid contact information in their variable selection altogether. I think that’s a shame, and a signal that they aren’t willing to put in the work to a) understand the data they’re working with, or b) take steps to counteract the perceived taint in the data.

If you took the trouble to understand your data (and why wouldn’t you), you’d find out soon enough if the variables are useable:

  • If the majority of phone numbers or business addresses or what-have-you are present in the database only because they came off donors’ cheques, then you’re right in not using it to predict giving. It’s not independent of giving and will harm your model. The telltale sign might be a correlation with the target variable that exceeds correlations for all your other variables.
  • If the information could have come to you any number of ways (with gift transactions being only one of them), then use with caution. That is, be alert if the correlation looks too good to be true. This is the most likely scenario, which I will discuss in detail shortly.
  • If the information could only have come from data appends or research, then you’ve got nothing much to worry about: The correlation with giving will be so weak that the variable probably won’t make it into your model at all. Or it may be a negative predictor, highlighting the people who allowed themselves to become lost in the first place. An exception to the “don’t worry” policy would be if research is conducted mainly to find past donors who have become lost — then there might be a strong correlation that will lead you astray.

An in-house predictive modeler will simply know what the case is, or will take the trouble to find out. A vendor hired to do the work may or may not bother — I don’t know. As far as my own models are concerned, I know that addresses and phone numbers come to us via a mix of voluntary and involuntary means: Via Phonathon, forms on the website, records research, and so on.

I’ve found that a simple count of all historical address updates for each alum is positively correlated with giving. But a line plot of the relationship between number of address updates and average lifetime giving suggests there’s more going on under the surface. Average lifetime giving goes up sharply for the first half-dozen or so updates, and then falls away just as sharply. This might indicate a couple of opposing forces: Alumni who keep us informed of their locations are more likely to be donors, but alumni who are perpetually lost and need to be found via research are less likely to be donors.

If you’re lucky, your database not only has a field in which to record the source of updates, but your records office is making good use of it. Our database happens to have almost 40 different codes for the source, applied to some 300,000 changes of address and/or phone number. Not surprisingly, some of these are not in regular use — some account for fewer than one-tenth of one percent of updates, and will have no significance in a model on their own.

For the most common source types, though, an analysis of their association with giving is very interesting. Some codes are positively correlated with giving, some negatively. In most cases, a variable is positive or negative depending on whether the update was triggered by the alum (positive), or by the institution (negative). On the other hand, address updates that come to us via Phonathon are negatively correlated with giving, possibly because by-mail donors tend not to need a phone call — if ‘giving’ were restricted to phone solicitation only, perhaps the association might flip toward the positive. Other variables that I thought should be positive were actually flat. But it’s all interesting stuff.

For every source code, a line plot of average LT giving and number of updates is useful, because the relationship is rarely linear. The relationship might be positive up to a point, then drop off sharply, or maybe the reverse will be true. Knowing this will suggest ways to re-express the variable. I’ve found that alumni who have a single update based on the National Change of Address database have given more than alumni who have no NCOA updates. However, average giving plummets for every additional NCOA update. If we have to keep going out there to find you, it probably means you don’t want to be found!

Classifying contact updates by source is more work, of course, and it won’t always pay off. But it’s worth exploring if your goal is to produce better, more accurate models.

14 September 2011

Is the online behaviour of your alums worth exploring?

By Peter Wylie and John Sammis

(Download a printer-friendly PDF version of this paper: Online behaviour of alums)

For a number of years John Sammis and I have been pushing colleges and universities to examine the data they (or their vendors) collect for alums who are members of their online communities. For example, we encourage them to look at very basic things like:

  • The number of e-mails an alum has opened since it’s been possible to get such data
  • The number of “click throughs” an alum has made to the website in response to an e-mail, an e-newsletter, and the like
  • The number of times an alum visits the website
  • The date and time of each visit

Why do we think they should be recording and examining these kinds of data? Because (based on some limited but compelling evidence) we think such data are related to how much and how often alums give to their alma maters as well as how engaged they are (e.g., reunion attendance, volunteering, etc.) to these institutions.  To ignore such data means leaving money on the table and losing a chance to spot alums who are truly interested in the school, even if they’ll never become major givers.

Frankly the response to our entreaties has been less than heartening:

  • “We don’t have an online community. If we get one, that’s probably a year or two away.”
  • “With the explosion of social media, we’re more interested in what we can learn about our alums through Facebook, LinkedIn, Twitter … I mean those are the sites our alums will be going to, not ours.”
  • “You want us to get record-by-record data from the vendor who maintains our site? Surely you jest. We’re lucky if they’ll send us decipherable summary data on email openings and click-throughs.”

But we’re nothing if not persistent. So what we’ve done here is put together some data from a four year higher education institution that has a pretty active online community. Granted, it’s only one school, but the data show a pronounced relationship between number of website visits and several different measures of alumni engagement and alumni giving.

We have to believe this school is not a glaring exception among the thousands of schools out there that have online communities. Our hope is that you’ll read through what we have to show and tell and conclude, “What the heck. Why don’t we take a similar look at our own data and see what we can see. Can’t hurt.”

Nope. Can’t hurt, and it might help – might help a lot.

A View of the Overall Distribution of Website Visits and the Distribution of Visits by Class Year

Table 1 shows that almost exactly two thirds of the alums have never visited the school’s website as an identifiable member of the school’s online community. The remaining third are roughly evenly divided among four categories: one visit; two to three visits; four to seven visits; and eight or more visits.

Table 1: Frequency and Percentage Distribution of Website Visits for More Than 40,000 Alums

As soon as we saw this distribution, we were quite sure it would vary a great deal depending how long people had been out of school. To confirm that hunch we divided all alums into ten roughly equal sized groups (i.e., into deciles).

Table 2: Count, Median Class Year, and Minimum and Maximum Class Years for All Alums Divided into Deciles

As you can see in Table 2, there are some very senior people in this alumni universe, and there are some very junior people. For example, the majority of folks in Decile 10 (CY 2006 – CY 2010) are probably in their 20’s. What about Decile 1 (CY 1926 –CY 1958)? It’s a safe bet that these folks are all over 70, and we may have at least one centenarian in the group (which we think is pretty cool).

If you look at Table 3, you can see the percentage distribution of website visits for each Class Year Decile. However, the problem with that table (and most tables that have lots of information in them) is that (unless you’re a data geek like we are) it’s not something you want to spend a lot of time looking at. You’d probably prefer to look at a chart, a graphic display of the data. So what we’ve done here (and throughout this paper) is display the data graphically for the folks in Decile 1, Decile 5, and Decile 10 – very senior people, middle-aged people, and very young people.

Table 3: Percentage of Website visits by Class Year Decile


Clearly our hunch was right. The distribution of website visits is highly related to how long people have been out of school:

  • Over 90% of alums who graduated before 1959 (Decile 10) have not visited the website.
  • In the youngest group (Decile 10) only a bit over 25% of alums have not visited the site.
  • You have to look at Table 3 to see the trend, but notice how “the 0 Visits” percentage drops for Deciles 7-10 (a span covering alums graduating in 1992 up to 2010):  68.9% down to 64.3% down to 46.5% down to 27.7%.


The Relationship between Number of Website Visits and Alumni Engagement

If you work in higher education advancement, you probably hear the term “alumni engagement” mentioned several times a week. It’s something lots and lots of folks are concerned about. And plenty of these folks are finding more and more ways to operationally define the term.

Here we’ve taken a very simple approach. We’ve looked at whether or not an alum had ever volunteered for the institution and whether or not an alum had ever attended a reunion.


Table 4 and Figures 4 to 6 show the relationship between number of website visits and volunteering. Just to be clear on what we’re laying out here, let’s go through some of the details of Table 4.

We’ll use Class year Decile 1 (alums who graduated between 1926 and 1958) as an example. Look at the alums in this Decile who have never visited the website; only 17.1% of them have ever volunteered. On the other hand, 42.9% of alums who have visited the website 8 or more times have volunteered. If you look at Figure 4, of course, you’ll see the same information depicted graphically.

Table 4: Percentage of Alums by Number of Website Visits for All Deciles Who Ever Volunteered

There are two facts that stick out for us in Table 4 and Figures 4 to 6:

  • Alums who have never visited the website are far less likely to have volunteered than those who have visited even once.
  • In general, there is a steady climb in the rate of volunteering as the number of website visits increases.

Reunion Attendance

If you look through Table 5 and Figures 7 to 9, you’ll see a relationship between number of website visits and reunion attendance that’s very similar to what you saw between number of website visits and volunteering. The one exception would be for the youngest group of alums – those in Decile 10 who graduated between 2006 and 2010. These alums simply are too young to have attended a five year reunion. (Although it would appear that several of them found a way to make it back to school anyway – good for them.)

Table 5: Percentage of Alums by Number of Website Visits for All Deciles Who Ever Attended a Reunion

The Relationship between Number of Website Visits and Giving

          There is no question that advancement offices are interested in alumni engagement. But if we’re realistic, we have to admit they tend to view engagement as mainly a step in the direction of one day becoming a donor. So let’s take a look at how number of website visits is related to alumni giving at this school.

We’ve created two sets of tables and figures to allow you to get a clear look at all this:

  • Table 6 and Figures 10 to 12 show the relationship between the number of website visits and giving over the past two fiscal years.
  • Table 7 and Figures 13-15 show the relationship between the number of website visits and lifetime giving of $10,000 or more.

Browse through all this material. After you’ve done that, we’ll tell you what we see.

Table 6: Percentage of Alums by Number of Website Visits for All Deciles Who Have Given Anything in the Last Two Fiscal Years

Table 7: Percentage of Alums by Number of Website Visits for All Deciles Who Have Given $10,000 or More Lifetime

Clearly, there is a lot of information contained in these tables and charts. But if we stand back from all that we see, the picture becomes clear. Regardless of how long alums have been out of school, those who have visited the website versus those who have not are better recent givers, and they are better major givers.

For example, let’s focus on alums who graduated before 1958 (Decile 1).  Those who have visited the website at least 8 times are almost twice as likely to have given in the last two fiscal years as those who have never visited the site (75% versus 41.6%). If we look at giving of $10,000 or more lifetime for this same Decile, the difference is even more striking: 42.9% versus 12.5%.

Let’s jump down to Decile 10, the “youngsters” who graduated between 2006 and 2010. Understandably, almost none of these alums have given $10,000 or more lifetime. But look at Figure 12. For this group the relationship between number of website visits and giving over the last two fiscal years is striking:

  • 27.8% for those with 0 website visits gave during this period.
  • 35.1% for those with 1 visit gave during this period.
  • 38.1% for those with 2-3 visits gave during this period.
  • 43.1% for those with 4-7 visits gave during this period.
  • 50.9% for those with 8 or more visits gave during this period.

Where to Go from Here

Clearly, there is a strong relationship between this simple web metric (number of website visits) and alumni engagement and alumni giving at this particular school. If that’s the case, it’s reasonable to assume that the same sort of relationship holds true for other schools. If you agree with that assumption, then we think it’s more than worth your while to take a look at similar data at your own institution.

At this point you might decide:

“Look guys, this is all very interesting, but we simply don’t have the time, resources, nor staff to do that. Maybe sometime in the future, when things are less hectic around here, we’ll take your advice. But not now.”

As much as we love this sort of analysis, we totally get a decision like that. We may be specialists, but we talk to enough people in advancement every week to realize you have a lot more on your minds than data mining and predictive modeling.

On the other hand, you might conclude that what we’ve done here is something you’d like to try to replicate, or improve on. If so, here’s what we’d recommend:

  1. Find out what kind of online data is available to you.
  2. Ask your technical folks to get those data into analyzable form for you.
  3. Do some simple analyses with the data.
  4. Share the results with colleagues you think would find it interesting.


1. Find Out What Kind of Data Is Available

Depending on how your shop is set up, this may take some persistence and digging. If it were us, we’d be trying to find out:

  • Has an alum ever opened an email that we’ve sent them? (In a lot of schools they don’t have to be a member of the online community for you to ascertain that.)
  • Have they ever opened an e-newsletter?
  • Have they ever clicked through to your website from an e-mail or e-newsletter?
  • Can you get counts for number of openings and number of click-throughs?

In all probability, you’ll be dealing with a vendor (either directly or through your IT folks) to get answers to these questions. Expect some pushback. A dialogue that goes like this would not be unusual:

YOU: Can I get the number of e-mails and e-newsletters that each of our alums has opened since the school has been sending out that kind of stuff?

VENDOR: We can certainly give you the number of e-mails and number of e-newsletters that were opened on each date that one was sent out.

YOU: That’s great, but that’s not what I’m looking for. I need to know, on a record-by-record basis, which alums opened the e-communication, and I need a total count for each alum for their total number of openings.

VENDOR: That’ll take some doing.

YOU: But you can do it?

VENDOR: I suppose.

YOU: Terrific!


2. Ask Your Technical Folks to Get The Data Into Analyzable Form.

What does “analyzable form” mean? To us that just means getting the data into spreadsheet format (probably Excel) where the first field will probably be the unique ID number you use to keep track of all the alums (and other constituents) in your fundraising database. For starters, we’d recommend something very simple. For example:

  • Field A: Unique ID number
  • Field B: Total amount of lifetime hard credit (for many alums, this value will be zero)
  • Field C: Total amount of hard credit for the last two fiscal years
  • Field D: Total number of e-mails or e-newsletters opened
  • Field E: Total number of click-throughs to your website from these e-mails and e-newsletters
  • Field F: Preferred class year of the alum

In our opinion, this kind of file should be very simple to build. In our experience, however, that is often not the case. (Why? How much time you got?)

Our frustrations with this sort of problem notwithstanding, keep pushing for the file. Be polite. Be diplomatic. And, above all, be persistent.


3. Do Some Simple Analyses with The Data.

There are any number of ways to analyze your data. Our bias would be to have you import the Excel file into a stats software package, and then do the analysis. (You can do it in Excel, but it’s a lot harder than if you use something like SPSS or Data Desk [our preference]).

If you can’t do this yourself, we’d recommend that you find someone on your team or on your campus to do it for you.  The right person, when you ask them if they can roughly replicate the tables and charts included in this paper, will say something like, “Sure,” “No problem,” “Piece of cake,” etc. If they don’t, keep looking.


4. Share The Results With Colleagues You Think Would Find It Interesting.

Sharing your results with colleagues should be stimulating and enjoyable. You know the folks you work with and have probably already got some in mind. But here are a few suggestions:

  • Look for people who think data driven decision making is important in higher education advancement.
  • Avoid people who are inclined to tell you why something can’t be done. Include people who enjoy finding ways around obstacles.
  • It’s okay to have one devil’s advocate in the group. More than one? That can be kind of frustrating.
  • If you can, get a vice president to join you. Folks at that level can help move things forward more easily than people at the director level, especially when it comes to “motivating” vendors to do things for you that they’d rather not do.

When you can, let us know how things go.

Older Posts »

The Silver is the New Black Theme. Create a free website or blog at


Get every new post delivered to your Inbox.

Join 1,059 other followers