Guest post by Kelly Heinrich, Assistant Director of Prospect Management and Analytics, Stanford University
Last August, about two months into a data analyst position with a university’s development division, I had the task to build a predictive model for the Office of Gift Planning (OGP). The OGP wanted a tool to help them focus on the constituents who are most likely to make a planned gift. I wanted to identify a few hundred of the best planned giving prospects who I could prioritize by the probability of donating. After a bit of preliminary research, I chose: 1) 50 years of age and older and 2) inclusion in a recent wealth screening as the criteria for the study population. This generated a file of 133,000 records; 582 of them were planned gift donors. I’ve worked with files larger than this and did not expect a problem. However, that turned out to be a mistake because the planned gift donors, who exhibited the target behavior, comprised 0.4% of the population, a proportion so small it can be considered rare. I’ll explain more about that later; first I want to describe the project as it developed.
I decided to use logistic regression with the dependent variable being either “made a planned gift” or “has not made a planned gift”. I cleaned the data and identified some strong relationships between the variables. After trying several combinations for the regression model, I had one with a Nagelkerke of .24, which is relatively good. (Nagelkerke is like a pseudo R squared; it can be loosely interpreted as the variability of the dependent variable that is accounted for by the model’s independent variables.) However, when I applied the algorithm to the study population, only 31 constituents without a planned gift and only 11 planned giving donors were identified as having a probability of giving of .5 or greater. I lowered the probability threshold of giving to .2 or greater and 105 non-planned givers and 52 planned gift donors fell into this range. This was still disappointing.
Desperate to identify more new potential prospects, I explored more criteria to narrow the study population and built three successive models. For the purpose of the follow-up exploratory research and this article, I re-built all four models using the same independent variables to easily compare their outcomes. Here’s a summary of the four models:
Models B, C, and D are all subsets of the original data set. Each model has advantages and disadvantages to it and I was uncertain how to evaluate them against one another. For example, each additional filtering criterion resulted in losing part of the target population, meaning that I systematically eliminated constituents with characteristics that are in fact associated with making a planned gift. I scored everyone who was identified with a probability of .2 or greater in any of the models by the number of models in which they were identified. I’m not unhappy with that solution, but since then I’ve been learning about better methods for targeting rare behavior.
If the OGP was interested only in prioritizing the prospects already in their pool of potential planned giving donors, model D would serve their need. However, we wanted to identify the best potential planned giving prospects within the database. If we want to uncover untapped potential in an ever-growing database, we need to explore methods on how to target rare behavior. This seems especially important in our field where 1) donating, in general, is somewhat rare and 2) donating really generous gifts is rarer. Better methods of targeting rare behavior will also be useful for modeling for special initiatives and unique kinds of gifts.
As I’ve been learning, logistic regression suffers from small sample bias when the target behavior is rare, relative to the study population. This helps explain why applying the algorithm to the original population resulted in very few new prospects–even though the model had a decent Nagelkerke of .24. Some analysts suggest using alternative sampling methods when the target behavior comprises less than 5% of the study. (See endnote.) Knowing that the planned gift donors in my original project comprised only 0.4% of the population, I decided to experiment with two new approaches.
In both of the exploratory models, I created the study population size so planned gift donors would comprise 5 percent. First, I generated a study population by including all 582 of the planned gift donors and a random selection of 11,060 non-planned-gift constituents (model E). Then, I applied the algorithm from that population to the entire non-planned-gift population of 132,418. In the second approach (model F), the planned gift population was randomly split into two equal size groups of 291. I also randomly selected 5,530 non-planned-gift constituents. To build the regression model, I combined one of the planned gift donor groups (of 291) with 5,530 non-planned-gift constituents. I then tested the algorithm on the holdout sample (the other planned giving group of 291 with 5,530 non-planned-gift constituents). Finally, I applied the algorithm to the entire original population of 133,000. Here are the results:
Using the same independent variables as in models A through D, model E had a Nagelkerke of .39 and model F .38, which helps substantiate that the independent variables are useful predictors for planned giving. Models E and F were more effective at predicting the planned givers (129 and 123 respectively with a probability of giving greater than or equal to .5) compared to model A (11), i.e. more than ten times as many. The sampling techniques have some advantages and disadvantages. The disadvantage is that by reducing the non-planned-gift population, it loses some of its variability and complexity. However, the advantage, in both models E and F, is that 1) the target population maintains its complexity, 2) new prospects are not limited by characteristic selection (the additional criteria that I used to reduce the population in models B, C, and D), which increases the likelihood of identifying constituents who were previously not on the OGP’s radar, and 3) the effects of the sample bias seem to be reduced.
It’s important to note that I displayed the measures (Nagelkerke and estimated probabilities) from the exploratory models and populations purely for comparison purposes. Because the study population is manipulated in the exploratory methods, the probability of giving should not be directly interpreted as actual probabilities. However, they can be used to prioritize those with the highest probabilities and that will serve our need.
To explore another comparison between models A and F, I ranked all 133,000 records in each. I then sorted all the records in model F in descending order. I took the top 1,000 records from model F and then ran correlation between the rank of model A and the rank of model F; they have a correlation of .282, meaning there is a substantial difference between the ranked records.
Over the last several months, Peter Wylie, Higher Education Consultant and Contractor, and I have been exchanging ideas on this topic. I thank him for his insight, suggestions, and encouragement to share my findings with our colleagues.
It would be helpful to learn about the methods you’ve used to target rare behavior. We could feel more confident about using alternative methods if repeat efforts produced similar outcomes. Furthermore, I did not have a chance to evaluate the prospecting performance of these models, so if you have used a method for targeting rare behavior and have had an opportunity to assess its effectiveness, I am very interested in learning about that. I welcome ideas, feedback, examples from your research, and questions in regard to this work. Please feel free to contact me at email@example.com.
The ideas for these alternative approaches are adapted from the following articles:
- Chawla, Nitesh, Aleksandar Lazarevic, Lawrence Hall, and Kevin Bowyer. 2003. “SMOTEBoost: Improving Prediction of the Minority Class in Boosting.” http://www3.nd.edu/~dial/papers/ECML03.pdf
- King, Gary and Lanche Zeng. 2001. “Logistic Regression in Rare Events Data.” http://gking.harvard.edu/files/abs/0s-abs.shtml
Kelly Heinrich has been conducting quantitative research and analysis in higher education development for two and a half years. She has recently accepted a position as Assistant Director of Prospect Management and Analytics with Stanford University that will begin in June 2013.