CoolData blog

7 June 2010

Beyond the simple score

Filed under: John Sammis, Model building, Peter Wylie — Tags: — kevinmacdonell @ 7:28 am

For anyone needing a non-technical introduction to data mining for fundraising, there is no better book than Peter Wylie’s “Data Mining for Fundraisers.” His simple-score method is easy to do and easy to explain. But it’s not the method I use. Today I want to talk about the limitations of the simple-score method.

Actually, what I view as a “limitation” today was the very thing that attracted me to his book, and to data mining, in the first place: The accessibility of the writing, and the simplicity of the method. It all seems obvious to me now, but back then, when the concepts were new, I needed to read the book through several times in order to completely get it. When I got it, I was able to describe it, and to convince others in my organization that we needed to do it. Had the concepts been more abstruse, I would not have learned anything. Even if I had, I would have had limited success selling data mining to anyone. In fact, Wylie’s book laid the foundation for everything else that came after.

These are the main limitations, however, as I see them today:

1. Too few score levels

If you’ve got six predictor variables, you’ll end up with seven score levels (counting zero as a possible score). If you want to segment the pools in your Annual Fund, that might be adequate, but just barely. You could introduce more variables, of course, but there are a couple of things standing in your way. First, very few predictor variables will pass the foolproof threefold test that Wylie prescribes in his book; in order to boost the number of usable variables you may have to bend the “rules” and accept a predictor even if it does not pass all three tests. However, there is a limit to how many variables you can accept: You will quickly cause problems for yourself due to limitation number 2 …

2. Subjective weightings

In the simple-score method, all positive predictors have a score value of 1, and all negative predictors have a value of -1. In other words, it assumes that all predictors are of equal value. We know that this cannot be true, but ignore it for the sake of simplicity. The foolproof nature of the tests we run for choosing our predictors does protect us from accepting a trivial predictor into the model, but the risks we run are twofold: One, we end up over- or under-counting the significance of individual predictors, and two, we may end up double-counting certain effects.

For example, the two variables “employer present” and “job title present” are obviously closely related, but both might pass the three tests with flying colours. They aren’t perfectly alike, so should we use them both in the model? Knowing the data, we would probably choose to use only one of the two. But what about other related variables that aren’t so obvious? “Marital status is single” will be highly correlated with “Class Year”, and “Job title present” will also be correlated with “Business phone present.” The simple-score method offers no way to account for these interactions among variables.

3. Limited fun for the data geek

There’s a bit of the thrill of the hunt in identifying some cool new predictor variable. A person can get quite creative in coming up with new ideas for things to test. (In a previous post, I listed 85 potential predictor variables I have tested for use in my models.) As I said, the simple-score method keeps one mostly out of trouble by disallowing variables that aren’t obviously predictive, but the limitation can get a little boring for the intrepid data explorer. Yes, the core variables that do three-quarters of the predictive work are limited to perhaps eight or ten, the same ones you’ll make use of in a simple-score model; but there is real fun is squeezing out just a little more insight by discovering those subtler variables hiding in your data.

Conclusion

This is not a judgment on the book — only a reminder that more advanced techniques lie beyond. After all, it was Peter Wylie (and John Sammis) who taught me how to use multiple linear regression to create the kinds of models I wanted. If the simple-score method answers your needs right now, as it did for me years ago, then use it!

1. So here’s a question…when you research someone and look up their contact info, do you have a separate field in your database for that? (as opposed to plugging the info into the regular contact fields). I have yet to see all that many databases that were designed with the forethought to have one field for user provided info, and one for researched info. While Wylie’s simple score is, like you said, a great starting place, I think that any organization that’s been doing prospect research for awhile will have “tainted” the data. Maybe he’s written something more recently, but the last I saw, Wylie wasn’t really pointing out that caveat. Granted, it should be common sense, because he clearly states his logic (user provided information indicates that the user *wanted* to provide it. The mere act of having them respond is a big deal when most people just toss mailers in the trash). I’ve recently come across 2 smaller nonprofits, however, that didn’t think about this, and a goodly chunk of their database had been completed by interns populating their mailing list details via google lookups.

Comment by Jen Olomon — 7 June 2010 @ 9:54 am

• This is a common objection, but not one that I really understand. If the contact information data is “tainted,” as you say, then the correlation with Giving will be affected. With heavy research activity, the correlation will be diluted over time because the connection between the presence or absence of data and Giving will be broken. And if the correlation isn’t significant, then you just don’t use contact info data as a predictor — end of story. There really is no risk that non-voluntary predictors will end up being used in your model.

This research is a good thing! We need fresh contact info coming in to keep our mail and calling pools full, because they are continuously draining due to normal attrition. That the activity might rob our models is less of a concern, in my opinion.

We can, as you suggest, have the best of both: If the database is sophisticated enough, yes, there should be some extra coding for “source”. In a univeristy database one might typically find a “research” code to identify alumni found by the alumni office after having had their mail returned undeliverable. The presence of this code indicates that the alum had allowed himself or herself to become lost, and not surprisingly that tends to be negatively correlated with giving. Knowing this, you can make the distinction between researched data and voluntarily-provided data, and get two great predictor variables — one negative, the other positive.

But in large databases, the percentage of phone numbers, etc. that had to be researched is NOT a majority of records, and the correlation with giving is quite significant overall. If “source” codes are not available, or time-consuming to obtain for some reason, then the mere presence or absence of the data will work just fine.

Small nonprofits with contact info populated from researched or purchased sources, well, that’s a whole other thing. Yes, they really do need to track other things if they want to gauge affinity. In those cases, having the data is probably unrelated to donor engagement.

Does that make sense?

Comment by kevinmacdonell — 7 June 2010 @ 10:23 am