CoolData blog

7 January 2015

New finds in old models


When you build a predictive model, you can never be sure it’s any good until it’s too late. Deploying a mediocre model isn’t the worst mistake you can make, though. The worst mistake would be to build a second mediocre model because you haven’t learned anything from the failure of the first.


Performance against a holdout data set for validation is not a reliable indicator of actual performance after deployment. Validation may help you decide which of two or more competing models to use, or it may provide reassurance that your one model isn’t total junk. It’s not proof of anything, though. Those lovely predictors, highly correlated with the outcome, could be fooling you. There are no guarantees they’re predictive of results over the year to come.


In the end, the only real evidence of a model’s worth is how it performs on real results. The problem is, those results happen in the future. So what is one to do?


I’ve long been fascinated with Planned Giving likelihood. Making a bequest seems like the ultimate gesture of institutional affinity (ultimate in every sense). On the plus side, that kind of affinity ought to be clearly evidenced in behaviours such as event attendance, giving, volunteering and so on. On the negative side, Planned Giving interest is uncommon enough that comparing expectancies with non-expectancies will sometimes lead to false predictors based on sparse data. For this reason, my goal of building a reliable model for predicting Planned Giving likelihood has been elusive.


Given that a validation data set taken from the same time period as the training data can produce misleading correlations, I wondered whether I could do one better: That is, be able to draw my holdout sample not from data of the same time period as that used to build the model, but from the future.


As it turned out, yes, I could.


Every year I save my regression analyses as Data Desk files. Although I assess the performance of the output scores, I don’t often go back to the model files themselves. However, they’re there as a document of how I approached modelling problems in the past. As a side benefit, each file is also a snapshot of the alumni population at that point in time. These data sets may consist of a hundred or more candidate predictor variables — a well-rounded picture.


My thinking went like this: Every old model file represents data from the past. If I pretend that this snapshot is really the present, then in order to have access to knowledge of the future, all I have to do is look at today’s data stored in the database.


For example, for this blog post, I reached back two years to a model I created in Data Desk for predicting likelihood to upgrade to the Leadership level in Annual Giving. I wasn’t interested in the model itself. Rather, I wanted to examine the underlying variables I had to work with at the time. This model had been an ambitious undertaking, with some 170 variables prepared for analysis. Many of course were transformations of variables or combinations of interacting variables. Among all those variables was one indicating whether a case was a current Planned Giving expectancy or not, at that point in time.


In this snapshot of the database from two years ago, some of the cases that were not expectancies would have become so since then. In other words, I now had the best of both worlds. I had a comprehensive set of potential predictors as they existed two years ago, AND access to the hitherto unknowable future: The identities of the people who had become expectancies after the predictors had been frozen in time.


As I said, my old model was not intended to predict Planned Giving inclination. So I built a new model, using “Is an Expectancy” (0/1) as the target variable. I trained the regression model on the two-year-old expectancy data — I didn’t even look at the new expectancies while building the model. No: I used those new expectancies as my validation data set.


“Validation” might be too strong a word, given that there were only 80 or so new cases. That’s a lot of bequest intentions, for sure, but in terms of data it’s a drop in the bucket compared with the number of cases being scored. Let’s call it a test data set. I used this test set to help me analyze the model, in a couple of ways.


First I looked at how new expectancies were scored by the model I had just built. The chart below shows their distribution by score decile. Slightly more than 50% of new expectancies were in the top decile. This looks pretty good — keeping in mind that this is what actual performance would have looked like had I really built this model two years ago (which I could have):




(Even better, looking at percentiles, most of the expectancies in that top 10% are concentrated nicely in the top few percentiles.)


But I didn’t stop there. It is also evident that almost half of new expectancies fell outside the top 10 percent of scores, so clearly there was room for improvement. My next step was to examine the individual predictors I had used in the model. These were of course the predictors most highly correlated with being an expectancy. They were roughly the following:
  • Year person’s personal information in the database was last updated
  • Number of events attended
  • Age
  • Year of first gift
  • Number of alumni activities
  • Indicated “likely to donate” on 2009 alumni survey
  • Total giving in last five years (log transformed)
  • Combined length of name Prefix + Suffix


I ranked the correlation of each of these with the 0/1 indicator meaning “new expectancy,” and found that most of the predictors were still fine, although they changed their order in the rank correlation. Donor likelihood (from survey) and recent giving were more important, and alumni activities and how recently a person’s record was updated were less important.


This was interesting and useful, but what was even more useful was looking at the correlations between ALL potential predictors and the state of being a new expectancy. A number of predictors that would have been too far down the ranked list to consider using two years ago were suddenly looking much better. In particular, many variables related to participation in alumni surveys bubbled closer to the top as potentially significant.


This exercise suggests a way to proceed with iterative, yearly improvements to some of your standard models:
  • Dig up an old model from a year or more ago.
  • Query the database for new cases that represent the target variable, and merge them with the old datafile.
  • Assess how your model performed or, if you created more than one model, see which model would have performed best. (You should be doing this anyway.)
  • Go a layer deeper, by studying the variables that went into those models — the data “as it was” — to see which variables had correlations that tricked you into believing they were predictive, and which variables truly held predictive power but may have been overlooked.
  • Apply what you learn to the next iteration of the model. Leave out the variables with spurious correlations, and give special consideration to variables that may have been underestimated before.

4 November 2013

Census Zip Code data versus internal data as predictors of alumni giving

Guest post by Peter Wylie and John Sammis

Thanks to data available via the 2010 US Census, for any educational institution that provides us zip codes for the alums in its advancement database, we can compute such things as the median income and the median house value of the zip code in which the alum lives.

Now, we tend to focus on internal data rather than external data. For a very long time the two of us have been harping on something that may be getting a bit tiresome: the overemphasis on finding outside wealth data in major giving, and the underemphasis on looking at internal data. Our problem has been that we’ve never had a solid way to systematically compare these two sources of data as they relate to the prediction of giving in higher education.

John Sammis has done a yeoman’s job of finding a very reasonably priced source for this Census data as well as building some add-ons to our statistical software package – add-ons that allow us to manipulate the data in interesting ways. All this has happened within the last six months or so, and I’ve been having a ball playing around with this data, getting John’s opinions on what I’ve done, and then playing with the data some more.

The data for this piece come from four private, small to medium sized higher education institutions in the eastern half of the United States. We’ll show you a smidgeon of some of the things we’ve uncovered. We hope you’ll find it interesting, and we hope you’ll decide to do some playing of your own.

Download the full, printer-friendly PDF of our study here (free, no registration required): Census ZIP data Wylie & Sammis.

10 March 2011

Gifts of stock as a predictor of Major Gift potential

Filed under: Major Giving, Model building, predictive modeling, Predictor variables, regression — Tags: , — kevinmacdonell @ 6:09 am

(Image used via Creative Commons license. Click image for source)

In an earlier post, I wrote about giving-related variables and whether or not they’re okay to use in a model that is trying to predict giving itself. (My answer was “it depends”. See Giving-related variables: Keep or leave out?) Today I zero in on a specific example: gifts of securities as a predictor of major giving.

Following the logic of my earlier post, if the sample of people whom you intend to score includes non-donors, and you want non-donors to have a chance of making it onto the radar, then you must rule out ‘Gift of Stock’ as a predictor. Why? Because you want to keep any proxy for your outcome variable (the Y side of your equation) out of the predictors (the X side of the equation), as much as possible. A ‘yes’ for ‘Has made a gift of stock’ is possible ONLY for the donors in your sample, and will provide no insight into a non-donor’s potential for major giving.

But giving-related variables are frequently used to predict major gift potential. Gift count, first gift, recency, and stock gifts are all enticing predictors. You have a decision to make: Do you exclude non-donors, or leave non-donors in and forgo the potential predictive power of these variables?

For some the answer might be easy. If the vast majority of major donors to your institution had some prior giving before making their biggest gifts, and a major gift from a non-donor is extremely unlikely, then it makes sense to exclude non-donors. This makes most sense for alumni models: Alumni who are solicited every year and don’t give are rather unlikely to turn around and give a million dollars. (Although it happens!)

You can avoid having to make the decision, however, if you build two models: One including non-donors (and using no giving-related variables), and one excluding them (freeing your hand to use giving-related variables). That’s what I do. I test the output scores against a holdout sample of major donors, and whichever model outperforms in scoring the major donors will be my choice for that year.

Let’s say that at least one of your models is a donor-only model, and you’re itching to use ‘Stock gifts’ as a predictor. Hold on! You’re not done yet. You need to evaluate the degree to which ‘Stock gifts’ is independent of your DV. If the variable equates to major giving itself, it is not at all independent and should be excluded. It is merely a proxy for being a major donor.

It’s clear that stock givers are different from other donors. In the data set I have before me, alumni who have made at least one gift of stock have median lifetime giving of about $40,000, compared with all other donors’ median giving of about $170. More than 66% of stock donors have lifetime giving over $25,000, and more than 90% of them have made at least one gift of $1,000 or greater.

The fact of having given a gift of securities cannot seriously be considered “independent” of the DV, but the degree of non-independence varies with how the DV is defined. If I define it as “LT Giving over $25K”, I’m probably in the clear, because a considerable portion of stock donors (34%, in my data set) fall outside the definition of my DV. If my DV is “One or more gifts of $1K or greater,” however, I should steer clear of the stock-gifts predictor. True, not all stock donors are in the DV, but almost all of them are.

Stock donors probably represent a very small percentage of all your donors, so the variable may have little influence either way: Not a high-value predictor, but not a damaging one, either. (Given the limited number in your sample, the correlation coefficient is going to be pretty low.) Maybe if 85% of the stock donors were in my DV, instead of 90%, I might go ahead and use it. So in the end, it’s a judgment call based on what seems to make sense for your data and what you hope to get out of it.

23 April 2010

The big list: 85 predictor variables for alumni models

Filed under: Model building, Predictor variables, regression — Tags: , — kevinmacdonell @ 10:06 am

Here is my attempt at compiling an exhaustive list of every predictor variable I have ever tested in the environment of my data – 85 of them! Not every variable is listed separately – some are grouped together by type or source. In some cases I’ve indicated whether the variable is an indicator variable (0/1) or a continuous variable, as necessary. A few variables are peculiar to the institution where I created my models. Variables that came from external sources are marked with an asterisk.

Some of these predictors were never used in a model because they were eclipsed by other, related variables that had stronger correlation with the dependent variable. Others (such as gender) proved problematic and were left out of my models for specific reasons. And some were tested and found not to be predictive at all. (A final model may contain only 15 to 20 good predictor variables.) Still, I include them all here, because any one of them might add value to models you build for your own data.

Also note: A number of predictors, listed at the end, are based on giving history. These are NOT to be used when your predicted value is ‘giving’. These variables were used in other models, such as Planned Giving potential and likelihood to attend events.

  • Class year
  • Earned a degree / Did not earn a degree
  • Number of degrees earned
  • Faculty is Education
  • Faculty is Business
  • Faculty is Arts
  • Faculty is Science
  • Spouse name present
  • Spouse is an alum
  • Spouse has giving (0/1)
  • Spouse lifetime giving (continuous)
  • Student activities present (0/1), eg. athletics, etc.
  • Number of student activities (continuous)
  • Religion present (0/1)
  • Religion is Roman Catholic (0/1)
  • Number of refusals to pledge
  • Refusal reason ‘will handle donation ourselves’
  • Requested to be excluded from affinity programs
  • Requested to be excluded from phone solicitation
  • Preferred address type is ‘Business’
  • Seasonal address present
  • Number of address updates
  • Address is in U.S.A.
  • Address is international
  • Province is Nova Scotia [also tested variables for other provinces]
  • Postal code is rural
  • Postal code is urban
  • Variables based on specific PSYTE cluster codes*
  • Has ‘Found’ code (i.e. records researcher has had to locate alum marked lost)
  • Prefers to read alumni magazine online (‘Green’ option)
  • Home phone number present
  • Business phone number present
  • Mobile phone number present
  • Seasonal phone number present
  • Number of phone updates
  • Home phone number is on Canada’s National Do Not Call Registry*
  • Email present
  • Number of email updates
  • Gender
  • Female-widowed
  • Female-married
  • Marital status ‘married’
  • Marital status ‘single’
  • Marital status ‘widow’
  • Marital status ‘divorced’
  • Marital status – other
  • Name prefix is “Dr.”
  • Name prefix is “Rev.” (or other religious)
  • Name prefix is Hon., Justice, or similar
  • Length of entire name
  • Nickname present
  • First name is single initial
  • Middle name is single initial
  • Suffix present
  • Cross-references present (0/1)
  • Number of cross-references (continuous)
  • Has attended Homecoming (0/1)
  • Number of Homecomings attended (continuous)
  • Number of President’s Receptions attended
  • Position (i.e. job title) present
  • Employer present
  • Number of employment updates
  • Employment status present
  • Employment status is ‘retired’
  • ID number begins with ‘F’ (faculty)
  • Registered as a member of the alumni online community
  • Participated in Alumni Engagement Benchmarking Survey* (0/1)
  • Engagement Survey score (continuous)*
  • [Numerous variables created from specific Engagement survey questions, including the following specific ones]
  • Lived primarily in residence while a student [survey]
  • Received a scholarship or bursary [survey]
  • Number of children under 18 [survey]
  • Enjoys speaking with student callers for Phonathon [survey]
  • Likely to attend Homecoming [survey]
  • Likely to attend an event in their area [survey]
  • Holds degrees from other universities [survey]
  • Number of close family members who are also alumni [survey]
  • Span of giving (last year of giving minus first year of giving)
  • Frequency of giving (gifts per year during span of giving)
  • Number of years in which gifts were made
  • Lifetime giving
  • Number of gifts
  • Recency: Gave in past year
  • Recency: Gave at least once in past two years
  • Recency: Gave at least once in past three years

Every year I discover new data points hiding in our database. Many other variables are out there, but often the data exists only for our youngest alumni. Someday, I’m sure, this additional data will yield cool new predictors. For ideas on other variables to look for in your data (including non-university data), refer to the list that begins on page 138 of Joshua Birkholz’s book, “Fundraising Analytics.”

Blog at