
Your data might be old, but that doesn't mean it isn't predictive. (Used via Creative Commons licence. Click image for source.)
Do you sometimes exclude variables from your model because you feel the data is just too old to be useful? I wouldn’t be too quick. For some data at least, there’s no expiry date when it comes to predictive modeling.
I’ve heard of some modelers using old wealth-screening data and getting good results. It may be too out of date for the Major Gift people to use, but if there’s still a correlation with giving, it does your model no harm to make use of it. Just be sure the data you’re using is capacity-related and not itself a predictive model, or existing donors will score high and high-likelihood non-donors will be submerged.
Old and out-of-date contact information works just fine. I always remind people that a phone number doesn’t have to be valid to be predictive. At some point, that alum provided that number (or email, or cell phone number), and the fact you have it at all is more important than whether you can still reach someone with it. For email in particular, probably a significant portion of your information is useless from a communication point of view, but will still be helpful in prediction. I am not talking about lost alumni — I never include those in my models. I mean alumni that we assume to be contactable.
I test whether the presence of, say, a business phone number is correlated with giving, but I also test the COUNT of business phone numbers. In order to do this, your database must retain the history of previous numbers. When an update is made, ideally a new record is created instead of overwriting the old one. This allows one to query for the NUMBER of update records — which I’ve often found correlates with giving. We know all those previous numbers were disconnected years ago, but their presence indicates a history of ongoing engagement.
What about event attendance? We might reasonably assume that an alum who attended a campus event a decade ago and never returned is far less likely to give than someone who visited just last year. Some schools have attendance data going back many years — is any of that still relevant? My answer is “probably.” I once got to study Homecoming attendance data for a university that had done a good job of recording it in the database going back 10 years. I already knew that Homecoming attendance was predictive of giving for this university, but I was surprised to discover that one-time, long-ago attendance was equally as predictive as recent attendance.
This may not hold true across institutions. You may want to break historical event attendance data into separate year categories to see whether they vary in their correlation with your predicted value. If they don’t differ significantly, then your most powerfully predictive variable will probably be a simple count of number of events attended: Repeat attendance, in my experience, is predictive in the extreme.
The enemy of relevance when it comes to data isn’t how old it is, but how incomplete or biased it is. For example, if you have good data on involvement on athletic teams up until 1985, and then nothing after that — that’s a problem. In that case, your variables for athletic involvement will be more informative about how old your alumni are than how engaged they might be. If you build a model that is restricted to older alumni, you’ll be fine, but if you include the entire database, ‘athletics’ will be highly correlated with ‘age’ and may add little or no predictive value.
What can you do? I see three strategies for addressing issues with older data, each one being appropriate in different situations.
- Leave it alone.
- Input the data.
- Impute the data.
We should leave the data alone when we know that the alum is the one who is primarily responsible for the presence or absence of data. All alumni who are not lost have more or less the same opportunity to provide us their contact and employment information. When all alumni have equal ability to influence some specific data point, the absence of data at that point is not a problem, but rather indicative of an attitude. No intervention is required. (A complicating factor is contact information that is purchased and appended — you should consult the ‘source’ code, if you have it, to distinguish between alum-provided data and data from other sources. The same might go for contact information that has been researched — but probably the number of records that have been researched is small in comparison with the entire database, and not a significant confounding factor.)
We should input the data when we know it to exist outside the database, when it is based on simple historical fact, and when it is practical to do so. An example would be student involvement in athletics. Unless capturing this information is someone’s explicit responsibility, the data will often be spotty; some class years will be covered and others won’t. Someone has this information — it’s probably in a file cabinet in the Athletics Office or, as a last resort, there’s always the yearbook — it just hasn’t been entered into the database. It’s a project, maybe a big project, but it might be quite doable with the help of a student or two. Would you go to this trouble just for the sake of predictive modeling? No. The risk is that the variable would still not be predictive. However, it isn’t hard to see that having the data will prove useful someday, perhaps for a special appeal directed at former student athletes. (An Alumni Records office with this sort of forward-looking, project-oriented mindset is a joy to work with!) If no data has ever been entered at all, and entering it retroactively isn’t a realistic goal, then why not just start tracking it from this point forward? It may be a long time before it becomes useful for data mining — you’ll be long-gone — but remember that our work rests on the shoulders of employees who have gone before, people who never heard of data mining but intuited that this or that category of data would someday prove useful.
And finally, we should impute the data when a variable is useful for prediction but excludes some sector of the alumni population through no fault of their own. Old wealth-screening data is a good example. If the data is ten years old, none of your recent graduates will have a wealth score. This might not be a problem if you’re building a Major Gift model or a Planned Giving model and excluding your younger alumni anyway, but for Annual Giving likelihood you should employ some of the techniques I discussed in previous posts on dealing with missing data. In those posts I was talking about survey data, but the idea is exactly the same. (See Surveys and missing data and More on surveys and missing data.) Essentially, the simpler techniques for imputing missing data involve substituting average values when we don’t know how an alum would have scored (or answered) had he or she had the opportunity to be included.
Search far and wide across the database for your predictors, but go deep as well — backwards in time!