People stumbling on CoolData might assume that I think I’ve gathered unto myself some great corpus of data mining knowledge and that now I presume to dispense it via this blog, nugget by nugget.
Uh, well – not quite.
The reality is that I spend a lot of my time at work and at home surrounded by my books, struggling to get my arms around the concepts, and doing a good deal of head-scratching. Progress is slow, as only about ten percent of my work hours are actually spent on data mining. Questions from CoolData readers are cause for anxiety more than anything else. (Questions are welcome, of course, but sometimes advice would be better.)
As a consequence, I proceed with caution when it comes to building models for my institution. I don’t have a great deal of time for testing and tweaking, and I steer clear of creating predictive score sets that cannot be deployed with a high level of confidence.
This caution has not prevented me from having some doubts about the model I created last year for our Planned Giving program, however.
This model sorted all of our alumni over a certain age into percentile ranks according to their propensity to engage with our institution in a planned giving agreement. Our Planned Giving Officer is currently focused on the individuals in the 97th percentile and up. Naturally, whenever a new commitment (verbal or written) comes across the transom (unsolicited, as I think PG gifts often are), the first thing I do is check the individual’s percentile score.
A majority of the new expectancies are in the 90s, which is good, and most of those are 97 and up, which is better. When I look at the Annual Giving model scores for these same individuals, however, I see that the AG scores do a better job of predicting the Planned Giving donors than the PG scores do. That strikes me as a bit odd.
Planned Giving being a slowly-evolving process, there aren’t enough examples of new commitments to properly evaluate the model, to my satisfaction at least. But when model-building time comes around again in July and August, I’ll be making some changes.
The central issue I faced was that current commitments numbered only a little over 100. That’s not a lot of historical data to model on. I asked around for advice. One key piece of advice was to cut down on the size of the prospect pool by excluding all alumni younger than our youngest current commitment. Done.
My primary interest, though, was to somehow legitimately boost the number of examples of PG donors, in order to beef up the dependent variable in a regression analysis.
Some institutions, I learned, tried to do this by digging into data on deceased planned giving donors, going back five or ten years. (I hope I do not strain decorum with the verb I’ve selected.) Normally we model only on living individuals, but having access to more examples of this type of donor has proven helpful for some. Unfortunately, on investigation I found that the technical issues involved made it prohibitively time-consuming: For various reasons, I would have had to perform many separate queries of the database in order to get at this data and merge it with that of the living population.
As luck would have it, though, around this time we received all the data from a huge, wide-ranging survey of alumni engagement we had conducted that March. One of the scale statements was specifically focused on attitudes towards leaving a bequest to our institution. The survey was non-anonymous, and a lot of positive responders to this statement were in our target age range. Bingo – I had a whole new group of “PG-oriented” individuals to add to my dependent variable. The PG model would be trained not only on current commitments, but on alumni who claimed to be receptive to the idea of planned giving.
In addition, I had the identities of a number of alumni who had attended information sessions on estate planning organized by our Planned Giving Officer.
I think all was well up to that point. What I did after that may have led to trouble.
I thought to myself, these PG-oriented people are not all of the same “value”. Surely a written gift commitment is “worth more” than a mere online survey response clicked on in haste. So I structured my dependent variable to look like this, using completely subjective ideas of what “value” ought to be assigned to each type of person:
- Answered “agree” to the PG statement in survey: 1 point
- Answered “strongly agree” to the PG statement in survey: 2 points
- Attended an estate planning session: 3 points
- Has made a verbal PG commitment: 6 points
- Has a written commitment in place: 8 points
Everyone else in the database was assigned a zero. And then I used multiple regression to create the model.
This summer, I think I will tone down the cleverness with my DV.
First of all, everyone with a pro-PG orientation (if I can put it that way) will be coded “1”. Everyone else will be coded “0”, and I will try using logistic regression instead of multiple regression, as more appropriate for a binary DV.
Going back to the original model, it occurs to me that my method was based on a general misconception of what I was up to. In creating these “levels of desirability,” I ignored the role of the Planned Giving Officer. My job, as I see it now, is to deliver up the segment of alumni that has the highest probability of receptivity to planned giving. It’s the PGO’s task to engage with the merely interested and elevate them to verbal, then written, agreements. In that sense, the survey-responder and the final written commitment could very well be equivalent in “value”.
The point is, it’s not in my power to make that evaluation. Therefore, this year, everyone with the earmarks of planned giving about them will get the same value: 1. I hope that results in a more statistically defensible method.
(I should add here that although I recognize my model could be improved, I remain convinced that even a flawed predictive model is superior to any assumption-based segmentation strategy. I’ve flogged that dead horse elsewhere.)
In the meantime, your advice is always appreciated.