CoolData blog

4 March 2011

Giving-related variables: Keep or leave out?

Filed under: Best practices, Model building, Predictor variables, regression — Tags: — kevinmacdonell @ 1:37 pm

When should you include giving-related variables in your predictive model, and when should you leave them out? This can seem a bit slippery, because there isn’t one answer that applies to all models. In fact, some things I’ve written lately have confused people, so it’s time to clarify the issue (as I understand it, at least).

A giving-related variable is any type of data that would not exist if the individual had not given a gift. Some of the confusion springs from an earlier blog post, my Big List of 100 predictor variables. Near the end of that giant list of suggested predictor variables, I tacked on another list of variables that are all giving-related. Without explanation, I remarked that these variables were reserved for “donor-only models for which the outcome variable is not ‘giving’.” A fellow data miner asked me in response: “Does that mean you normally don’t use giving stats for your major giving models? I’m confused. I would tend to throw all this stuff in my model.”

The answer to that question is: “It depends.” Primarily, it depends on the nature of the question I am trying to answer, which is to say, how I define the dependent variable (a.k.a. outcome variable). Secondarily, but related to the first, it depends on whether I want to score non-donors as well as donors, or donors only.

Let’s say we want to predict who will give to the Annual Fund. It’s a donor acquisition model, so of course our sample to score will include both donors and non-donors. The dependent variable could be defined as either ‘Lifetime giving’ or the binary variable ‘Is a donor’ (0/1). Either way, we should avoid using predictors that are proxies for our dependent variable — that is, predictors that mean the same thing as our dependent variable. For example, ‘Number of gifts’ or even ‘Made a gift with an American Express card’ are both proxies for ‘Is a donor’ — all the non-donors in your data set will by definition be excluded from having this attribute.

If I have non-donors in my sample and I want to give them scores, then I have to stay away from ANY giving-related variables. This would be especially true of an acquisition model: I want the non-donors to have a chance at getting a high score, but that won’t happen if I introduce giving-related variables into the model.

“Throwing everything into the model” might give you an excellent value for R squared, but that’s only because you’ve allowed “Y” to leak into the “X” side of the equation. In other words, the model is predicting “donors will be donors, and non-donors won’t be donors.” “Y=Y” is not a very interesting insight. That’s the very opposite of what predictive modeling is all about.

Now suppose that instead of simple participation, we want to predict leadership-level giving in Annual Fund, or maybe even propensity to give a major gift. The dependent variable might be defined differently. Instead of ‘Lifetime giving’, we might define the DV as “Gave $10,000 or more lifetime” (0/1). Suddenly, that frees us up to introduce a few giving-related variables into the model — as long as we are not wanting to score non-donors as well. So I might use variables such as “Has made an anonymous gift” or “Has made a gift using American Express.” Frequency of giving, Recency of giving, Gift type — these are all OK, because none of these categorically rules out participation by lower-end donors. HOWEVER, because I’m trying to predict which individuals will give at elevated levels, I CAN’T use variables that are stand-ins for ‘gave lots of money’ – such as ‘attended a donor-recognition gala’ or ‘member of the President’s Circle’, and so on. All the existing big donors would get great scores and everyone else would be at the bottom. Again, that’s not providing insight.

Ultimately, it’s not about whether giving-related variables should be used as predictors. It’s about keeping ANY proxy for your outcome variable (the Y side of your equation) out of the predictors (the X side of the equation), as much as possible. For example, if you are modeling for likelihood to attend an event, and your DV is defined as “Has attended an event”, or maybe “Number of events attended,” then you wouldn’t use as a predictor the fact that someone filled out a survey to give feedback about the event they just attended — because we know that variable equates to the behaviour of having attended an event.

I say avoid this “as much as possible,” because often the lack of independence is not so obvious. Some good predictors are not truly independent of Y, a condition we can detect if we are watchful while evaluating our predictors. But predictors must be at least partially independent, or the scores that result will be not very interesting. Not invalid, exactly — just sort of useless.

So your choice comes down to the nature of your question, and the composition of your model. Once you’ve defined the question, how do you decide which model to build? Here’s what I do. I build at least two models, sometimes several: I use different outcome variables; I use different data sets, one with donors only, another that includes non-donors; I include or exclude giving-related predictors, and finally I use both multiple linear regression and binary logistic regression. Then I check how each model scores a holdout sample, and I go with whichever model does the best job.

Advertisements

5 Comments »

  1. […] in a model that is trying to predict giving itself. (My answer was “it depends”. See Giving-related variables: Keep or leave out?) Today I zero in on a specific example: gifts of securities as a predictor of major […]

    Pingback by Gifts of stock as a predictor of Major Gift potential « CoolData blog — 10 March 2011 @ 6:09 am

  2. Suppose you are 3/4 of the way through the year and the Annual Gift team isn’t on track to meet their goals. They decide to make a special push to get 1,000 more donors and they ask you for a list of the 1,000 people who have not yet given a gift this year who are the most likely to do so. Would this be a situation where you would use giving data from previous years?

    Comment by Jim — 25 April 2011 @ 11:37 am

  3. Jim, I think that depends on what you mean by a “special push.” The whole predictive modeling exercise is already aimed at focusing on the people “who have not yet given this year who are most likely to do so.” If one had a high confidence level in the initial model, there would be little point in building a new model. The situation might call for a change in strategy, not models. In the scenario you describe, you are implying that not every past donor has been solicited yet – the strategy would seem to call for halting acquisition efforts and refocusing exclusively on renewal (and reactiviation) for people with some past giving history. People with past giving (especially recent past giving) are more likely to give than people with no giving – I would never dispute that. We know this without needing to model it. (I presume your “1,000 more donors” would not include converting current non-donors, because, of course, they would have no giving data from previous years.) Let’s leave aside the fact that suspending donor acquisition to meet a short-term, and probably arbitrary, goal is not a great strategy for the health of the program; we just have to suck it up and try to meet the goal. Maybe the special push isn’t about “more donors,” but about identifying which donors are most likely to be successfully upgraded to a leadership level of giving. In that case, a new model might indeed be useful. One way or the other, YES, the situation you outline would seem to call for using past giving data: whether in simply focusing on people with past giving, or in going all the way and building a new model that predicts propensity to give at higher levels. BUT, I say it again: Your choice of variables doesn’t depend on the situation, it depends on the question you’re trying to answer and the way you define the dependent variable.

    Comment by kevinmacdonell — 25 April 2011 @ 12:23 pm

  4. I am just getting started in predictive modeling and plan on using giving history as a significant variable in my model; this is how. In modeling, I plan to look back at the last fiscal years donations as the discrete dependent variable for my model. A few major explanatory variables will likely be related to giving in years prior to last. For example, I think that with FY10 (Y or N) as the dependent var, I will likely use FY09 (Y or N), and FY04-08 (Y or N) as two major explanatory vars. The only problem that I see with doing this is age or grad year related, but to deal with thi I think I will likely be splitting my data up into multiple models for different prospect types. By doing an alumni specific model, I can then include an interaction variable between grad year and the donation history vars, which in my opinion will serve me best in developing an accurate predictive models. I’m just curious if you still feel that giving history is a dangerous inclusion, and if so, why that is.

    Comment by Big Dog — 10 June 2011 @ 4:04 pm

    • Without know the specific goal of your modeling project, I’m not sure what the right answer is. But in general I am not in favour of using giving to predict giving except under very controlled circumstances. If you’re trying to identify those non-donors who are most likely to convert to donors, then completely stay away from giving history as a predictor. If you’re trying to determine which past donors have the highest lifetime value to your fundraising program, then maybe RFM and not predictive modeling is the tool you want to use. I see what you’re trying to do: You’re trying to put a hard wall between time periods. Giving in one time period is predictive of giving in some other time period. The problem is, your time-defined wall must also be true of your predictor variables, so that a future condition is not present in your past variables. Your predictor variables have to represent a snapshot of your database from some point in the past. That sounds like more work than it’s worth, to be honest. It’s just my opinion, but I think you’d be better off to maximize the independence of your predictor variables from what you’re trying to predict. Especially if you’re just starting out in modeling, you’ll run fewer risks if you keep your dependent variable strictly out of your predictors. Having a low R-square is not an issue. Having an insipid model, which tells you nothing more than “last year’s donors are most likely to give this year”, is an issue in that it’s kind of a waste of time. Does that make sense?

      Comment by kevinmacdonell — 15 June 2011 @ 1:24 pm


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: