When should you include giving-related variables in your predictive model, and when should you leave them out? This can seem a bit slippery, because there isn’t one answer that applies to all models. In fact, some things I’ve written lately have confused people, so it’s time to clarify the issue (as I understand it, at least).
A giving-related variable is any type of data that would not exist if the individual had not given a gift. Some of the confusion springs from an earlier blog post, my Big List of 100 predictor variables. Near the end of that giant list of suggested predictor variables, I tacked on another list of variables that are all giving-related. Without explanation, I remarked that these variables were reserved for “donor-only models for which the outcome variable is not ‘giving’.” A fellow data miner asked me in response: “Does that mean you normally don’t use giving stats for your major giving models? I’m confused. I would tend to throw all this stuff in my model.”
The answer to that question is: “It depends.” Primarily, it depends on the nature of the question I am trying to answer, which is to say, how I define the dependent variable (a.k.a. outcome variable). Secondarily, but related to the first, it depends on whether I want to score non-donors as well as donors, or donors only.
Let’s say we want to predict who will give to the Annual Fund. It’s a donor acquisition model, so of course our sample to score will include both donors and non-donors. The dependent variable could be defined as either ‘Lifetime giving’ or the binary variable ‘Is a donor’ (0/1). Either way, we should avoid using predictors that are proxies for our dependent variable — that is, predictors that mean the same thing as our dependent variable. For example, ‘Number of gifts’ or even ‘Made a gift with an American Express card’ are both proxies for ‘Is a donor’ — all the non-donors in your data set will by definition be excluded from having this attribute.
If I have non-donors in my sample and I want to give them scores, then I have to stay away from ANY giving-related variables. This would be especially true of an acquisition model: I want the non-donors to have a chance at getting a high score, but that won’t happen if I introduce giving-related variables into the model.
“Throwing everything into the model” might give you an excellent value for R squared, but that’s only because you’ve allowed “Y” to leak into the “X” side of the equation. In other words, the model is predicting “donors will be donors, and non-donors won’t be donors.” “Y=Y” is not a very interesting insight. That’s the very opposite of what predictive modeling is all about.
Now suppose that instead of simple participation, we want to predict leadership-level giving in Annual Fund, or maybe even propensity to give a major gift. The dependent variable might be defined differently. Instead of ‘Lifetime giving’, we might define the DV as “Gave $10,000 or more lifetime” (0/1). Suddenly, that frees us up to introduce a few giving-related variables into the model — as long as we are not wanting to score non-donors as well. So I might use variables such as “Has made an anonymous gift” or “Has made a gift using American Express.” Frequency of giving, Recency of giving, Gift type — these are all OK, because none of these categorically rules out participation by lower-end donors. HOWEVER, because I’m trying to predict which individuals will give at elevated levels, I CAN’T use variables that are stand-ins for ‘gave lots of money’ – such as ‘attended a donor-recognition gala’ or ‘member of the President’s Circle’, and so on. All the existing big donors would get great scores and everyone else would be at the bottom. Again, that’s not providing insight.
Ultimately, it’s not about whether giving-related variables should be used as predictors. It’s about keeping ANY proxy for your outcome variable (the Y side of your equation) out of the predictors (the X side of the equation), as much as possible. For example, if you are modeling for likelihood to attend an event, and your DV is defined as “Has attended an event”, or maybe “Number of events attended,” then you wouldn’t use as a predictor the fact that someone filled out a survey to give feedback about the event they just attended — because we know that variable equates to the behaviour of having attended an event.
I say avoid this “as much as possible,” because often the lack of independence is not so obvious. Some good predictors are not truly independent of Y, a condition we can detect if we are watchful while evaluating our predictors. But predictors must be at least partially independent, or the scores that result will be not very interesting. Not invalid, exactly — just sort of useless.
So your choice comes down to the nature of your question, and the composition of your model. Once you’ve defined the question, how do you decide which model to build? Here’s what I do. I build at least two models, sometimes several: I use different outcome variables; I use different data sets, one with donors only, another that includes non-donors; I include or exclude giving-related predictors, and finally I use both multiple linear regression and binary logistic regression. Then I check how each model scores a holdout sample, and I go with whichever model does the best job.