Validation against a holdout sample allows us to pick the best model for predicting a behaviour of interest. (See Thoughts on model validation.) But I also like to do what I call “validation after the fact.” At the end of a fundraising period, I want to see how people who expressed that behaviour broke down by the score they’d been given.
This isn’t really validation, but if you create some charts from the results, it’s the best way to make the case to others that predictive modeling works. More importantly, doing so may provide insights into your data that will lead to improvements in your models in their next iterations.
This may be most applicable in Annual Fund, where the prospects solicited for a gift may come from a wide range of scores, allowing room for comparison. But my general rule is to compare each score level by ratios, not counts. For example, if I wanted to compare Phonathon prospects by propensity score, I would compare the percentage (ratio) of each score group contacted who made a pledge or gift, not the number of prospects who did so. Why? Because if I actually used the scores in solicitation, higher-scoring prospects will have received more solicitation attempts on average. I want results to show differences among scores, not among levels of intensity of solicitation.
So when the calling season ended recently, I evaluated my Phonathon model’s performance, but I didn’t study the one model in isolation: I compared it with a model that I initially rejected last year.
It sounds like I’m second-guessing myself. Didn’t I pick the very best model at the time? Yes, but … I would expect my chosen model to do the best job overall, but perhaps not for certain subgroups — donor types, degree types, or new grads. Each of these strata might have been better described by an alternative model. A year of actual results of fundraising gives me what I didn’t have last year: the largest validation sample possible.
My after-the-fact comparison was between a binary logistic regression model which I had rejected, and the multiple linear regression model which I actually used in Phonathon segmentation. As it turned out, the multiple linear regression model did prove the winner in most scenarios, which was reassuring. I will spare you numerous comparison charts, but I will show you one comparison where the rejected model emerged as superior.
Eight percent of never-donor new grads who were contacted made a pledge. (My definition of a new grad was any alum who graduated in 2008, 2009, or 2010.) The two charts below show how these first-time donors broke down by how they were scored in each model. Due to sparse data, I have left out score level 10.
Have a look, and then read what I’ve got to say.
Neither model did a fantastic job, but I think you’d agree that predicting participation for new grads who have never given before is not the easiest thing to do. In general, I am pleased to see that the higher end of the score spectrum delivered slightly higher rates of participation. I might not have been able to ask for more.
The charts appear similar at first glance, but look at the scale of the Y-axis: In the multiple linear regression model, the highest-scoring group (9, in this case) had a participation rate of only 12%, and strangely, the 6th decile had about the same rate. In the binary logistic regression model, however, the top scoring group reached above 16% participation, and no one else could touch them. The number of contacted new grads who scored 9 is roughly equal between the models, so it’s not a result based on relatively sparse data. The BLR model just did a better job.
There is something significantly different about either new grads, or about never-donors whom we wish to acquire as donors, or both. In fact, I think it’s both. Recall that I left the 10s out of the charts due to sparse data — very few young alumni can aspire to rank up there with older alumni using common measures of affinity. As well, when the dependent variable is Lifetime Giving, as opposed to a binary donor/nondonor state, young alumni are nearly invisible to the model, as they are almost by definition non-donors or at most fledgling donors.
My next logical step is a model dedicated solely to predicting acquisition among younger alumni. But my general point here is that digging up old alternative models and slicing up the pool of solicited prospects for patterns “after the fact” can lead to new insights and improvements.