CoolData blog

15 January 2013

The cautionary tale of Mr. S. John Doe

A few years ago I met with an experienced Planned Giving professional who had done very well over the years without any help from predictive modeling, and was doing me the courtesy of hearing my ideas. I showed this person a series of charts. Each chart showed a variable and its association with the condition of being a current Planned Giving expectancy. The ultimate goal would have been to consolidate these predictors together as a score, in order to discover new expectancies in that school’s alumni database. The conventional factors of giving history and donor loyalty are important, I conceded, but other engagement-related factors are also very predictive: student activities, alumni involvement, number of degrees, event attendance, and so on.

This person listened politely and was genuinely interested. And then I went too far.

One of my charts showed that there was a strong association between being a Planned Giving expectancy and having a single initial in the First Name field. I noted that, for some unexplained reason, having a preference for a name like “S. John Doe” seemed to be associated with a higher propensity to make a bequest. I thought that was cool.

The response was a laugh. A good-natured laugh, but still — a laugh. “That sounds like astrology!”

I had mistaken polite interest for a slam-dunk, and in my enthusiasm went too far out on a limb. I may have inadvertently caused the minting of a new data-mining skeptic. (Eventually, the professional retired after completing a successful career in Planned Giving, and having managed to avoid hearing much more about predictive modeling.)

At the time, I had hastened to explain that what we were looking at were correlations — loose, non-causal relationships among various characteristics, some of them non-intuitive or, as in this case, seemingly nonsensical. I also explained that the linkage was probably due to other variables (age and sex being prime candidates). Just because it’s without explanation doesn’t mean it’s not useful. But I suppose the damage was done. You win some, you lose some.

Although some of the power (and fun) of predictive modeling rests on the sometimes non-intuitive and unexplained nature of predictor variables, I now think it’s best to frame any presentation to a general audience in terms of what they think of as “common sense”. Limiting, yes. But safer. Unless you think your listener is really picking up what you’re laying down, keep it simple, keep it intuitive, and keep it grounded.

So much for sell jobs. Let’s get back to the data … What ABOUT that “first-initial” variable? Does it really mean anything, or is it just noise? Is it astrology?

I’ve got this data set in front of me — all alumni with at least some giving in the past ten years. I see that 1.2% percent of all donors have a first initial at the front of their name. When I look at the subset of the records that are current Planned Giving expectancies, I see that 4.6% have a single-initial first name. In other words, Planned Giving expectancies are almost four times as likely as all other donors to have a name that starts with a single initial. The data file is fairly large — more than 17,000 records — and the difference is statistically significant.

What can explain this? When I think of a person whose first name is an initial and who tends to go by their middle name, the image that comes to mind is that of an elderly male with a higher than average income — like a retired judge, say. For each of the variables Age and Male, there is in fact a small positive association with having a one-character first name. Yet, when I account for both ‘Age’ and ‘Male’ in a regression analysis, the condition of having a leading initial is still significant and still has explanatory power for being a Planned Giving expectancy.

I can’t think of any other underlying reasons for the connection with Planned Giving. Even when I continue to add more and more independent variables to the regression, this strange predictor hangs in there, as sturdy as ever. So, it’s certainly interesting, and I usually at least look at it while building models.

On the other hand … perhaps there is some justification for the verdict of “astrology” (that is, “nonsense”). The data set I have here may be large, but the number of Planned Giving expectancies is less than 500 — and 4.6% of 500 is not very many records. Regardless of whether p ≤ 0.0001, it could still be just one of those things. I’ve also learned that complex models are not better than simple ones, particularly when trying to predict something hard like Planned Giving propensity. A quirky variable that suggests no potential causal pathway makes me wary of the possibility of overfitting the noise in my data and missing the signal.

Maybe it’s useful, maybe it’s not. Either way, whether I call it “cool” or not will depend on who I’m talking to.

About these ads

The Silver is the New Black Theme. Create a free website or blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 1,066 other followers

%d bloggers like this: