# CoolData blog

## 28 May 2013

### Targeting rare behavior

Filed under: Planned Giving, regression — Tags: , , — kevinmacdonell @ 5:21 am

## Guest post by Kelly Heinrich, Assistant Director of Prospect Management and Analytics, Stanford University

Last August, about two months into a data analyst position with a university’s development division, I had the task to build a predictive model for the Office of Gift Planning (OGP). The OGP wanted a tool to help them focus on the constituents who are most likely to make a planned gift. I wanted to identify a few hundred of the best planned giving prospects who I could prioritize by the probability of donating. After a bit of preliminary research, I chose: 1) 50 years of age and older and 2) inclusion in a recent wealth screening as the criteria for the study population. This generated a file of 133,000 records; 582 of them were planned gift donors. I’ve worked with files larger than this and did not expect a problem. However, that turned out to be a mistake because the planned gift donors, who exhibited the target behavior, comprised 0.4% of the population, a proportion so small it can be considered rare. I’ll explain more about that later; first I want to describe the project as it developed.

I decided to use logistic regression with the dependent variable being either “made a planned gift” or “has not made a planned gift”. I cleaned the data and identified some strong relationships between the variables. After trying several combinations for the regression model, I had one with a Nagelkerke of .24, which is relatively good. (Nagelkerke is like a pseudo R squared; it can be loosely interpreted as the variability of the dependent variable that is accounted for by the model’s independent variables.) However, when I applied the algorithm to the study population, only 31 constituents without a planned gift and only 11 planned giving donors were identified as having a probability of giving of .5 or greater. I lowered the probability threshold of giving to .2 or greater and 105 non-planned givers and 52 planned gift donors fell into this range. This was still disappointing.

Desperate to identify more new potential prospects, I explored more criteria to narrow the study population and built three successive models. For the purpose of the follow-up exploratory research and this article, I re-built all four models using the same independent variables to easily compare their outcomes. Here’s a summary of the four models:

Models B, C, and D are all subsets of the original data set. Each model has advantages and disadvantages to it and I was uncertain how to evaluate them against one another. For example, each additional filtering criterion resulted in losing part of the target population, meaning that I systematically eliminated constituents with characteristics that are in fact associated with making a planned gift. I scored everyone who was identified with a probability of .2 or greater in any of the models by the number of models in which they were identified. I’m not unhappy with that solution, but since then I’ve been learning about better methods for targeting rare behavior.

If the OGP was interested only in prioritizing the prospects already in their pool of potential planned giving donors, model D would serve their need. However, we wanted to identify the best potential planned giving prospects within the database. If we want to uncover untapped potential in an ever-growing database, we need to explore methods on how to target rare behavior. This seems especially important in our field where 1) donating, in general, is somewhat rare and 2) donating really generous gifts is rarer. Better methods of targeting rare behavior will also be useful for modeling for special initiatives and unique kinds of gifts.

As I’ve been learning, logistic regression suffers from small sample bias when the target behavior is rare, relative to the study population. This helps explain why applying the algorithm to the original population resulted in very few new prospects–even though the model had a decent Nagelkerke of .24. Some analysts suggest using alternative sampling methods when the target behavior comprises less than 5% of the study. (See endnote.) Knowing that the planned gift donors in my original project comprised only 0.4% of the population, I decided to experiment with two new approaches.

In both of the exploratory models, I created the study population size so planned gift donors would comprise 5 percent. First, I generated a study population by including all 582 of the planned gift donors and a random selection of 11,060 non-planned-gift constituents (model E). Then, I applied the algorithm from that population to the entire non-planned-gift population of 132,418. In the second approach (model F), the planned gift population was randomly split into two equal size groups of 291. I also randomly selected 5,530 non-planned-gift constituents. To build the regression model, I combined one of the planned gift donor groups (of 291) with 5,530 non-planned-gift constituents. I then tested the algorithm on the holdout sample (the other planned giving group of 291 with 5,530 non-planned-gift constituents). Finally, I applied the algorithm to the entire original population of 133,000. Here are the results:

Using the same independent variables as in models A through D, model E had a Nagelkerke of .39 and model F .38, which helps substantiate that the independent variables are useful predictors for planned giving. Models E and F were more effective at predicting the planned givers (129 and 123 respectively with a probability of giving greater than or equal to .5) compared to model A (11), i.e. more than ten times as many. The sampling techniques have some advantages and disadvantages. The disadvantage is that by reducing the non-planned-gift population, it loses some of its variability and complexity. However, the advantage, in both models E and F, is that 1) the target population maintains its complexity, 2) new prospects are not limited by characteristic selection (the additional criteria that I used to reduce the population in models B, C, and D), which increases the likelihood of identifying constituents who were previously not on the OGP’s radar, and 3) the effects of the sample bias seem to be reduced.

It’s important to note that I displayed the measures (Nagelkerke and estimated probabilities) from the exploratory models and populations purely for comparison purposes. Because the study population is manipulated in the exploratory methods, the probability of giving should not be directly interpreted as actual probabilities. However, they can be used to prioritize those with the highest probabilities and that will serve our need.

To explore another comparison between models A and F, I ranked all 133,000 records in each. I then sorted all the records in model F in descending order. I took the top 1,000 records from model F and then ran correlation between the rank of model A and the rank of model F; they have a correlation of .282, meaning there is a substantial difference between the ranked records.

Over the last several months, Peter Wylie, Higher Education Consultant and Contractor, and I have been exchanging ideas on this topic. I thank him for his insight, suggestions, and encouragement to share my findings with our colleagues.

It would be helpful to learn about the methods you’ve used to target rare behavior. We could feel more confident about using alternative methods if repeat efforts produced similar outcomes. Furthermore, I did not have a chance to evaluate the prospecting performance of these models, so if you have used a method for targeting rare behavior and have had an opportunity to assess its effectiveness, I am very interested in learning about that. I welcome ideas, feedback, examples from your research, and questions in regard to this work. Please feel free to contact me at heinrichkellyl@gmail.com.

Endnotes

The ideas for these alternative approaches are adapted from the following articles:

Kelly Heinrich has been conducting quantitative research and analysis in higher education development for two and a half years. She has recently accepted a position as Assistant Director of Prospect Management and Analytics with Stanford University that will begin in June 2013.

## Guest post by John Sammis and Peter B. Wylie

Thanks to all of you who read and commented on our recent paper comparing logistic regression with multiple regression. We were not sure how popular this topic would be, but Kevin told us that interest was high, and there were a number of comments and questions. There were several general themes in the comments; Kevin has done an excellent job responding, but we thought we should throw in our two cents.

Why not just use logistic?

The point of our paper was not to suggest that logistic regression should not be used — our point was that multiple regression can achieve prediction results quite similar to logistic regression. Based on our experience working with and training fundraising professionals getting introduced to analytics, logistic regression can be intimidating. Our goal is always to get these folks to use analytics to help with their fundraising initiatives. We find many of them catch on with multiple regression, and much less so with logistic regression.

Predicted values vs. probabilities

We understand that the predicted values generated by multiple regression are different from the probabilities generated by logistic regression. Regardless of the statistic modeling technique we use, we always bin the raw prediction or probability values into equal-sized score levels. We have found that score level bins are easier to use than raw values. And using equal-sized score levels allows for easier evaluation of the scoring model.

“I cannot agree”

Some commenters, knowledgeable about statistics, said they would not use multiple regression when the inputs called for logistic. According to the rules, if the target variable is binary, then linear modelling doesn’t make sense — and the rules must be obeyed. In our view, this rigid approach to method selection is inappropriate for predictive modelling. The use of multiple linear regression in place of logistic regression may not always make theoretical sense, but predictive modellers are concerned with whether or not a model produces an output that is useful in practical terms. The worth of a model is testable against new, real-world data, therefore a model has only one criterion for determining “appropriate” use: Whether it really predicts what the modeler claims it will predict. The truth is revealed during evaluation.

A modest proposal

No one reading this should simply take our word that these two dissimilar methods yield similar results. Neither should anyone dismiss it out of hand without providing a critique based on real data. We would encourage anyone to try doing something on your own with data using both techniques and show us what you find. In particular, graduate students looking for a thesis or dissertation topic might consider producing something under this title: “Comparing Logistic Regression and Multiple Regression as Techniques for Predicting Major Giving.”

Heck! Peter says that if anyone were interested in doing a study like this for a thesis or dissertation, he would be willing to offer advice on how to:

1. Do a thorough literature review
2. Formulate specific research questions
3. Come up with a study design
4. Prepare a proposal that would satisfy a thesis or dissertation committee.

That’s quite an offer. How about it?

## 29 June 2010

### Choosing the right flavour of regression

Filed under: Model building, regression, Statistics, Terminology, Uncategorized — Tags: , , — kevinmacdonell @ 5:50 am

(Creative Commons license. Click image for source.)

I use two types of regression analysis to build predictive models: multiple linear regression and binary logistic regression. Both are called “regression”, but they are very different animals. You can use either one to build a model, but which one is best for fundraising models?

The answer is that there is no best option that applies across the board. It depends on what you’re trying to predict, certainly, but even more so it depends on the data itself. The best option will not be obvious and will be revealed only in testing. I don’t mean to sound careless about proper statistical practice, but we work in the real world: It’s not so much a question of “which tool is most appropriate?” as “Which tool WORKS?”

One of the primary differences between the two types of regression is the definition of the dependent variable. In logistic regression, this outcome variable is either 1 or 0. (There are other forms of logistic regression with multiple nominal outcomes, but I’ll stick to binary outcomes for now.) An example might be “Is a donor / Is not a donor,” or “Is a Planned Giving expectancy/ Is not a planned giving expectancy.”

In multiple regression, the dependent variable is typically a continuous value, like giving expressed in real dollars (or log-transformed dollars). But the DV can also be a 0/1 value, just as in logistic regression. Technically using a binary variable violates one of the assumptions underlying multiple regression (a normal probability distribution of the DV), but that doesn’t necessarily invalidate the model as a powerful predictive tool. Again, what works?

Another difference, less important to my mind, is that the output of a multiple regression analysis is a predicted value that reflects the units (say, dollars) of the DV and may be interpretable as such (predicted lifetime giving, predicted gift amount, etc.), while the output of a logistic regression is a probability value. My practice is to transform both sorts of outputs into scores (deciles and percentiles) for all individuals under study; this allows me to refer to both model outputs simply as “likelihood” and compare them directly.

So which to use? I say, use both! If you want some extra confidence in the worth of your model, it isn’t that much trouble to prepare both score sets and see how they compare. The key is having a set of holdout cases that represent the behaviour of interest. If your model is predicting likelihood to become a Planned Giving expectancy, you first set aside some portion of existing PG expectancies, build the model without them, then see how well the model performed at assigning scores to that holdout set.

You can use this method of validation when you create only one model, too. But where do you set the bar for confidence in the model if you test only one? Having a rival model to compare with is very useful.

In my next post I will show you a real-world example, and explain how I decided which model worked best.