CoolData blog

20 September 2012

When less data is more, in predictive modelling

When I started doing predictive modelling, I was keenly interested in picking the best and coolest predictor variables. As my understanding deepened, I turned my attention to how to define the dependent variable in order to really get at what I was trying to predict. More recently, however, I’ve been thinking about refining or limiting the population of constituents to be scored, and how that can help the model.

What difference does it make who gets a propensity score? Up until maybe a year ago, I wasn’t too concerned. Sure, probably no 22-year-old graduate had ever entered a planned giving agreement, but I didn’t see any harm in applying a score to all our alumni, even our youngest.

Lately, I’m not so sure. Using the example of a planned gift propensity model, the problem is this: Young alumni don’t just get a score; they also influence how the model is trained. If all your current expectancies were at least 50 before they decided to make a bequest, and half your alumni are under 30 years old, then one of the major distinctions your model will make is based on age. ANY alum over 50 is going to score well, regardless of whether he or she has any affinity to the institution, simply because 100% of your target is in that age group.

The model is doing the right thing by giving higher scores to older alumni. If ages in the sample range from 21 to 100+, then age as a variable will undoubtedly contribute to a large chunk of the model’s ability to “explain” the target. But this hardly tells us anything we didn’t already know. We KNOW that alumni don’t make bequest arrangements at age 22, so why include them in the model?

It’s not just the fact that their having a score is irrelevant. I’m concerned about allowing good predictor variables to interact with ‘Age’ in a way that compromises their effectiveness. Variables are being moderated by ‘Age’, without the benefit of improving the model in a way that we get what we want out of it.

Note that we don’t have to explicitly enter ‘Age’ as a variable in the model for young alumni to influence the outcome in undesirable ways. Here’s an example, using event attendance as a predictor:

Let’s say a lot of very young alumni and some very elderly constituents attend their class reunions. The older alumni who attend reunions are probably more likely than their non-attending classmates to enter into planned giving agreements — for my institution, that is definitely the case. On the other hand, the young alumni who attend reunions are probably no more or less likely than their non-attending peers to consider planned giving — no one that age is a serious prospect. What happens to ‘event attendance’ as a predictor in which the dependent variable is ‘Current planned giving expectancy’? … Because a lot of young alumni who are not members of the target variable attended events, the attribute of being an event attendee will be associated with NOT being a planned giving expectancy. Or at the very least, it will considerably dilute the positive association between predictor and target found among older alumni.

I confirmed this recently using some partly made-up data. The data file started out as real alumni data and included age, a flag for who is a current expectancy, and a flag for ‘event attendee’. I massaged it a bit by artificially bumping up the number of alumni under the age of 50 who were coded as having attended an event, to create a scenario in which an institution’s events are equally popular with young and old alike. In a simple regression model with the entire alumni file included in the sample, ‘event attendance’ was weakly associated with being a planned giving expectancy. When I limited the sample to alumni 50 years of age and older, however, the R squared statistic doubled. (That is, event attendance was about twice as effective at explaining the target.) Conversely, when I limited the sample to under-50s, R squared was nearly zero.

True, I had to tamper with the data in order to get this result. But even had I not, there would still have been many under-50 event attendees, and their presence in the file would still have reduced the observed correlation between event attendance and planned giving propensity, to no useful end.

You probably already know that it’s best not to lump deceased constituents in with living ones, or non-alumni along with alumni, or corporations and foundations along with persons. They are completely distinct entities. But depending on what you’re trying to predict, your population can fruitfully be split along other, more subtle distinctions. Here are a few:

  • For donor acquisition models, in which the target value is “newly-acquired donor”, exclude all renewed donors. You strictly want to have only newly-acquired donors and never-donors in your model. Your good prospects for conversion are the never-donors who most resemble the newly-acquired donors. Renewed donors don’t serve any purpose in such a model and will muddy the waters considerably.
  • Conversely, remove never-donors from models that predict major giving and leadership-level annual giving. Those higher-level donors tend not to emerge out of thin air: They have giving histories.
  • Looking at ‘Age’ again … making distinctions based on age applies to major-gift propensity models just as it does to planned giving propensity: Very young people do not make large gifts. Look at your data to find out at what age donors were when they first gave $1,000, say. This will help inform what your cutoff should be.
  • When building models specifically for Phonathon, whether donor-acquisition or contact likelihood, remove constituents who are coded Do Not Call or who do not have a valid phone number in the database, or who are unlikely to be called (international alumni, perhaps).
  • Exclude international alumni from event attendance or volunteering likelihood models, if you never offer involvement opportunities outside your own country or continent.

Those are just examples. As for general principles, I think both of the following conditions must be met in order for you to gain from excluding a group of constituents from your model. By a “group” I mean any collection of individuals who share a certain trait. Choose to exclude IF:

  1. Nearly 100% of constituents with the trait fall outside the target behaviour (that is, the behaviour you are trying to predict); AND,
  2. Having a score for people with that trait is irrelevant (that is, their scores will not result in any action being taken with them, even if a score is very low or very high).

You would apply the “rules” like this … You’re building a model to predict who is most likely to answer the phone, for use by Phonathon, and you’re wondering what to do with a bunch of alumni who are coded Do Not Call. Well, it stands to reason that 1) people with this trait will have little or no phone contact history in the database (the target behaviour), and 2) people with this trait won’t be called, even if they have a very high contact-likelihood score. The verdict is “exclude.”

It’s not often you’ll hear me say that less (data) is more. Fewer cases in your data file will in fact tend to depress your model’s R squared. But your ultimate goal is not to maximize R squared — it’s to produce a model that does what you want. Fitting the data is a good thing, but only when you have the right data.


  1. I’ve been building models to guide strategy for our Annual Giving program for the last 5 years and came to much the same conclusion–though in much more of an intuitive/trial & error based method than the test sampling scenario you outlined.

    My first aha was that Renewing & Rejoining are very different behaviors. I now use 2 models rather than 1 to predict likelihood to give. One model is a Renewal likelihood model for Last Year donors and the other is a Rejoin likelihood model for 2 – 3 year lapsed donors.

    I’ve also identified a few groups that I consider as outlier groups for these 2 models. They are:
    * Our Own Employees – we have an Employee Giving campaign that occurs internally…as part of this campaign, we have made a pledge not to send our own employees solicitations outside of this campaign. This is an instance where not only did the presence of the group skew the results of the model for non-employee donors, we also were not going to change our solicitation plan based on any results from the model.
    * United Way Designation Only Donors – We have donors whose only gift in the year being modeled is actually made to a United Way agency and designated to us.
    * Donors with Outstanding Pledge Balances – The thinking here is that these donors are more or less guaranteed to renew. They’ve already pledged and are scheduled for a payment so including them in the model does nothing but skew the results for the other donors.

    There is one other interesting donor group I’ve identified, although they do not come into play with either of these donor models. These are donors whose entire giving history has been made through Events and Fundraisers (they aren’t included in the 2 Annual Models because we don’t consider event giving to be part of our Annual Giving program). I’ve shown that these constituents are extremely difficult to convert to Annual Fund donors (on average, we only convert 2 – 5% of last year’s donor’s whose entire giving history is through events/fundraisers to making a gift to the Annual Fund).

    I think that identifying these outlier groups is every bit as important as carefully defining your dependent variable.

    Comment by David Logan — 21 September 2012 @ 3:31 pm

  2. Do universities have regular giving program? thanks

    Comment by cnukus — 30 September 2012 @ 7:49 am

  3. David,

    ” I’ve shown that these constituents are extremely difficult to convert to Annual Fund donors “, is that because you have tried to convert them, but conversion rate is low, or they are currently not exposed to Annual Fund?

    Comment by cnukus — 30 September 2012 @ 7:53 am

    • That is an excellent question. The short answer is yes, they have been exposed to the Annual Fund…but not with any particular strategy geared toward this unique group of donors. For a number of years, our department was concentrated on expanding our Major Gift team and there was no real thought given to how to best use our Annual Program. The result was that we sent a few specific appeals around the same 2 time periods every year and all direct mail was sent to every recent donor. Over the past 5 years, we have worked hard to diversify the appeals produced for the Annual Fund (both in terms of content and timing) and to better target the mailing lists for each one (we have one appeal that saw its mailing list reduced from around 18,000 to 2,500 w/o any significant changed to $ raised–needless to say, the ROI is much improved for this one). My feeling about this group of event only donors is that timing will be key to improving our conversion of them from supporting events to donating directly to the Annual Fund. They need an appeal as soon as possible after the event they attend. Starting this year, we have been checking for donors who have made their first gift this year since our last appeal that meet this definition of event only donors and making sure to include them on the list for our next appeal. This really is just a start and there is much more to learn about this group. On my list of future needs/projects:

      * For now, I mostly have not made an attempt to understand the specific event/fundraisers these donors are participating in. There are 2 each year that really are ours and that we have a lot of control in the messaging for, but there are many others that are 3rd party events (they are events hosted by someone else for our benefit). It could be beneficial to align events/fundraisers to specific appeals according to their focus (events supporting our Cancer Center would receive an appeal focused on Cancer for instance).
      * I’ve only looked at conversion into the Annual Giving program. I need to look at conversion to Major Gift donors. For the 2 events that we host, I suspect we are more successful at converting event only participants into Major Gift donors than we are at converting them into Annual Giving donors simply by the nature of these events.
      * I’ve given some thought to developing a new donor packet that is specific for new donors whose first gift is made through an event/fundraiser.

      Thanks for the great question!

      Comment by David Logan — 1 October 2012 @ 10:13 am

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create a free website or blog at

%d bloggers like this: