CoolData blog

15 January 2013

The cautionary tale of Mr. S. John Doe

A few years ago I met with an experienced Planned Giving professional who had done very well over the years without any help from predictive modeling, and was doing me the courtesy of hearing my ideas. I showed this person a series of charts. Each chart showed a variable and its association with the condition of being a current Planned Giving expectancy. The ultimate goal would have been to consolidate these predictors together as a score, in order to discover new expectancies in that school’s alumni database. The conventional factors of giving history and donor loyalty are important, I conceded, but other engagement-related factors are also very predictive: student activities, alumni involvement, number of degrees, event attendance, and so on.

This person listened politely and was genuinely interested. And then I went too far.

One of my charts showed that there was a strong association between being a Planned Giving expectancy and having a single initial in the First Name field. I noted that, for some unexplained reason, having a preference for a name like “S. John Doe” seemed to be associated with a higher propensity to make a bequest. I thought that was cool.

The response was a laugh. A good-natured laugh, but still — a laugh. “That sounds like astrology!”

I had mistaken polite interest for a slam-dunk, and in my enthusiasm went too far out on a limb. I may have inadvertently caused the minting of a new data-mining skeptic. (Eventually, the professional retired after completing a successful career in Planned Giving, and having managed to avoid hearing much more about predictive modeling.)

At the time, I had hastened to explain that what we were looking at were correlations — loose, non-causal relationships among various characteristics, some of them non-intuitive or, as in this case, seemingly nonsensical. I also explained that the linkage was probably due to other variables (age and sex being prime candidates). Just because it’s without explanation doesn’t mean it’s not useful. But I suppose the damage was done. You win some, you lose some.

Although some of the power (and fun) of predictive modeling rests on the sometimes non-intuitive and unexplained nature of predictor variables, I now think it’s best to frame any presentation to a general audience in terms of what they think of as “common sense”. Limiting, yes. But safer. Unless you think your listener is really picking up what you’re laying down, keep it simple, keep it intuitive, and keep it grounded.

So much for sell jobs. Let’s get back to the data … What ABOUT that “first-initial” variable? Does it really mean anything, or is it just noise? Is it astrology?

I’ve got this data set in front of me — all alumni with at least some giving in the past ten years. I see that 1.2% percent of all donors have a first initial at the front of their name. When I look at the subset of the records that are current Planned Giving expectancies, I see that 4.6% have a single-initial first name. In other words, Planned Giving expectancies are almost four times as likely as all other donors to have a name that starts with a single initial. The data file is fairly large — more than 17,000 records — and the difference is statistically significant.

What can explain this? When I think of a person whose first name is an initial and who tends to go by their middle name, the image that comes to mind is that of an elderly male with a higher than average income — like a retired judge, say. For each of the variables Age and Male, there is in fact a small positive association with having a one-character first name. Yet, when I account for both ‘Age’ and ‘Male’ in a regression analysis, the condition of having a leading initial is still significant and still has explanatory power for being a Planned Giving expectancy.

I can’t think of any other underlying reasons for the connection with Planned Giving. Even when I continue to add more and more independent variables to the regression, this strange predictor hangs in there, as sturdy as ever. So, it’s certainly interesting, and I usually at least look at it while building models.

On the other hand … perhaps there is some justification for the verdict of “astrology” (that is, “nonsense”). The data set I have here may be large, but the number of Planned Giving expectancies is less than 500 — and 4.6% of 500 is not very many records. Regardless of whether p ≤ 0.0001, it could still be just one of those things. I’ve also learned that complex models are not better than simple ones, particularly when trying to predict something hard like Planned Giving propensity. A quirky variable that suggests no potential causal pathway makes me wary of the possibility of overfitting the noise in my data and missing the signal.

Maybe it’s useful, maybe it’s not. Either way, whether I call it “cool” or not will depend on who I’m talking to.

2 March 2010

Fun, creative and lesser-known predictive variables

Filed under: Alumni, Predictor variables — Tags: , , , — kevinmacdonell @ 12:58 pm

Your next predictive variable will be found here. (Creative Commons license. Click for source)

University offices record all kinds of things in their databases simply in order to run their own processes: mailing the alumni magazine, ticketing for events, coding mailing preferences and on and on. Finding novel predictors for your models requires talking to colleagues in your department (and around campus) about the database screens they use, and the things they track. Exploring these avenues can be rewarding and rather social as well!

Here are a few variables I’ve tested which might be lesser-known than the ones I’ve written about earlier. These aren’t likely to appear near the top of your list of variables that are most highly correlated with giving, but it certainly won’t hurt to throw some into a regression analysis. Some variables will be more or less valuable depending on what you’re trying to predict. Some of these are negative predictors; that’s hardly a bad thing, as negative predictors will help to further differentiate the prospect pool, allowing your best prospects to stand out from the crowd.

Here we go:

Does your institution have a records researcher? When mail is returned as undeliverable to the alumni office, this person is busy coding alumni as “lost”, which marks them for later research. These codes may persist in your database after the alum is found, or they might be replaced with another code. In either case, I’ve found that alumni who allow themselves to become lost are less likely to give. A great negative predictor.

Does your alumni magazine have a “green delivery” option? Some alumni opt to access their magazine exclusively by electronic means, as a PDF download perhaps. Mailing preferences are tracked in your database, and often any sort of stated preference is a predictor.

You may already be using ‘number of phonathon refusals’ as a variable, but does your calling program record the reasons for refusal? “Financial reasons” might be a negative predictor, but not all reasons have to be negative. I’ve found that alumni who refuse because they want to handle the donation on their own (for example, mail a cheque when and if they feel like it) are excellent donors. They’re just rather phone-averse.

What about cross-references? We record family relationships among alumni – even grandparent/grandchild and in-laws. I’ve found ‘number of cross-references’ to be a significant predictor.

Alumni who want to be excluded from affinity programs (credit cards, insurance etc.) may be coded in your database so they do not receive unwanted mailings for those products. A negative predictor.

There might be a weird variable or two lurking in people’s names. For certain models, I’ve found that having a first or middle name that consists of a single initial is a positive predictor. This is somewhat correlated with age, but even after adding ‘class year’ to my regression, this variable will still improve the fit of the model. As well, Peter Wylie has written about the character length of an entire name (Prefix, First, Middle, Last, Suffix) being a predictor. Try it.

A year or so ago, I figured out how to query the database to easily retrieve the number of address updates for each alum. This only works when your records personnel create a new address record every time, instead of replacing the previous record. If an alum keeps their alma mater informed of their whereabouts, they’re probably more engaged – and more likely to give (and attend events). Ditto for number of phone updates and number of employment updates.

The previous idea is related to “class notes” for the alumni magazine. Some universities enter alumni submissions into their database so they can run their notes as a report. We don’t, but I wish we did, because I know ‘number of notes’ would be a predictor.

This might be the tip of the iceberg. Think of all the other great sources of variables that result from normal daily processes (gift processing data, online social networking data, automated call centre data, survey data …), have those conversations with your colleagues, and figure out how to get your hands on those variables for testing.

Blog at