CoolData blog

17 February 2010

Is ‘overfitting’ really a problem?

Filed under: Model building, Pitfalls, Planned Giving, Predictor variables — Tags: , , — kevinmacdonell @ 8:06 am

(Used via Creative Commons license. Click image for source.)

Overfitting describes a condition where your data fits a model “too well”. Your model describes your sample nearly perfectly, but is too rigid to fit any other sample. It isn’t loose enough to serve your predictive needs.

Is this something you ought to worry about? My response is a qualified ‘no’.

First, if your sample is very large, in the many thousands of records, and you’re modeling for a behaviour which is not historically rare (giving to the Annual Fund, for example), then overfit just isn’t an issue. Overfit is something to watch for when you’ve got small sample sizes or your data is limited in some way: building a Planned Giving or Major Giving model based on only a handful of existing cases of the desired behaviour, for example.

Overfit has always sounded like a theoretical problem to me, something that bothers analysts working at some rarefied higher level of modeling refinement. My goal has always been to improve on existing segmenting practices; if the bar is set at “throwing darts at the board,” one is going to be happy with the results of a predictive model, even if it’s wearing a too-restrictive corset.

And yet … doubts crept in.

While creating a model for Planned Giving potential I discovered a characteristic prevalent among our existing expectancies which gave me pause. Many of our existing commitments are from clergy, a number of whom live in retirement on campus. This results from our institution’s history and its traditional association with the Roman Catholic Church. Not surpringly, a name prefix identifying clergy turned out to be a highly predictive variable. Using the variable in the model would have boosted the fit – but at what cost?

Here’s the problem. Elderly clergy members may be the model for past and current expectancies, but I was not confident that the Planned Giving donors of the future would resemble them. Societal changes resulting in a growing distance between church and university was one of the reasons leading me to think that using this variable would be a mistake – this model needed more leeway than that. It took a while for me to make the connection between this gut feeling and the rather abstract concept of ‘overfit’.

This, then, is my advice: Forget about the theory and use common sense – are any of your predictor variables likely to do a much better job describing the reality of the past than that of the future? Don’t overthink it: If your gut’s mostly okay with it, then don’t worry about it. Otherwise, consider sacrificing a little R-squared to get a better model.


5 February 2010

Rare-event modeling: Terrorists and planned giving

Filed under: Model building, Planned Giving — Tags: , , — kevinmacdonell @ 3:11 pm

In January, the White House released a review of the incident in which a would-be bomber nearly destroyed a passenger jet in flight on Christmas Day. Why did anti-terrorism officials fail to identify and counter this threat? According to the report, part of the problem was in the databases, and in the data-mining software: “Information technology within the CT [counterterrorism] community did not sufficiently enable the correlation of data that would have enabled analysts to highlight the relevant threat information.”

I’ve just finished reading Stephen Baker’s book, The Numerati, published in 2008. In a chapter called simply “Terrorist”, he observes that it’s nearly impossible to build a predictive model of “rare or unprecedented events,” citing the few cataclysmic examples that we all know about. “This is because math-based predictions rely on patterns of past behaviour,” he writes.

Known and suspected terrorists are presumably the needle in a huge haystack that includes you, me, and everyone else in the world. Terrorists are practically invisible in such a sea of identities, they work hard at avoiding detection, and they trigger events that may never have happened before.

Not to trivialize the subject, but while reading this it struck me that some of the models we build in the more prosaic world of fundraising are in the related business of modeling for rare events. I’m thinking primarily of Major Gifts and Planned Giving. As tricky as this sort of prediction is, we can be thankful for three things: The events we are trying to predict are rare but not unprecedented, the data set has precise limits, and the stakes are not nearly as high.

Here is a basic tip for improving the power of a Planned Giving model. My first attempt at a PG model included the full data set of alumni, from the oldest alum right up to the Class of 2009. We had a limited number of people in the database indentified as existing PG commitments, and they were swimming in that ocean of data. I took a number of steps to improve the model, but the most obvious was to exclude all the younger alumni. They would not normally be considered PG prospects, and eliminating them boosted the ratio of PG commitments to the general population.

Look at your existing commitments, identify who the youngest is (by class year, probably), and exclude all the alumni who are younger than that. (Use a selector variable in Data Desk right in your regression table, if that’s what you’re doing.) If your youngest is an outlier, then pick the second-youngest as your cutoff – but don’t eliminate the outlier individual, because you need all the historical data you can get!

19 January 2010

Have a dues-based alumni association? Read this paper.

Filed under: Alumni, Model building, Predictor variables — Tags: , , , , — kevinmacdonell @ 4:42 pm

Advancement consultant Peter B. Wylie and predictive modeling expert John Sammis have recently published a new paper, Data Mining and Alumni Association Membership. Like all of their work it’s written in a way anyone can understand. And like some of my recent posts have pointed out, it shows how data mining can be a powerful tool when used to predict all sorts of behaviours besides giving.

This time they’re showing you how you can use certain key pieces of information in your database to predict who will be most likely to want to join your dues-based alumni association. Their paper identifies the key variables that tend to be strongly related to active alumni association membership, and demonstrates how to create a predictive score. Their data came from four public higher-education universities with graduate and undergraduate enrollments that ranged from 4,500 to 27,000.

They believe schools should be using this information to save money on membership appeals, and boost membership.

And I do, too.

Addendum (20 Jan 2010): FYI, Peter Wylie is interviewed in the current issue of CASE Currents magazine. Will post link if it becomes available.

17 January 2010

Proving ‘event attendance likelihood’ actually works

Filed under: Event attendance, Model building, Predictive scores, skeptics — Tags: , , , , — kevinmacdonell @ 6:56 pm

In an earlier post I talked about what you need to get started to build an ‘event attendance likelihood’ model. Today I want to provide some evidence to back up my claim that yes, you can identify which segment of your alumni population is most likely to attend your future event.

To recap: Every living, addressable alumnus/na in our database is scored according to how likely he or she is to attend an event, whether it be a President’s Reception or Homecoming, whether they’ve ever attended an event or not.

The scores can be used to answer these types of questions:

  • What’s the top 30% of alumni living in Toronto who should be mailed a paper invite to the President’s Reception?
  • Who are the 50 members of the Class of 2005 who are most likely to come to Homecoming for their 5th-year reunion?

I built our first event-attendance model last summer. As I always do, I divided all our alumni into deciles by the predicted values that are produced by the regression analysis (the ‘raw score’). The result is that all alumni were ranked from a high score of 10 (most likely to attend an event) to 1 (least likely).

At that time, alumni were sending in their RSVPs for that fall’s Homecoming event. Because I use only actual-attendance data in my models, these RSVPs were not used as a source of data. … That made Homecoming 2009 an excellent test of the predictive strength of the new model.

Have a look at this chart, which shows how much each decile score contributed to total attendance for Homecoming 2009. The horizontal axis is Decile Score, and the vertical axis is Percentage of Attendees. Almost 45% of all alumni attendees had a score of 10 (the bar highlighted in red).

(A little over 4% of alumni attendees had no score. Most of these would have been classified as ‘lost’ when the model was created, and therefore were excluded at that time. In the chart, they are given a score of zero.)

To put it another way, almost three-quarters of all alumni attendees have a score of 8 or higher. But those 10 scores are the ones who really stand out.

Let me anticipate an objection you might have: Those high-scoring alumni are just the folks who have shown up for events in the past. You might say that the model is just predicting that past attendees are going to attend again.

Not quite. In fact, a sizable percentage of the 10-scores who attended Homecoming had never attended an event before: 23.1%.

The chart below shows the number of events previously attended by the 10-scored alumni who were at Homecoming in 2009. The newbies are highlighted in red.

The majority of high-scoring attendees had indeed attended previous events (a handful had attended 10 or more!). But that one-quarter hadn’t – and were still identified as extremely likely to attend in future.

That’s what predictive modeling excels at: Zeroing in on the characteristics of individuals who have exhibited a desired behaviour, and flagging other individuals from the otherwise undifferentiated masses who share those characteristics.

Think of any ‘desired behaviour’ (giving to the annual fund, giving at a higher level than before, attending events, getting involved as an alumni volunteer), then ensure you’ve got the historical behavioural data to build your model on. Then start building.

14 January 2010

Building your ‘event attendance likelihood’ model

Filed under: Event attendance, Model building — Tags: , , , , — kevinmacdonell @ 12:20 pm

Photo courtesy of Alumnae Association of Mount Holyoke College (Creative Commons licence)

Your model’s predicted value doesn’t always have to be ‘giving’. Once you’ve discovered the power of predictive modeling for your fundraising efforts, you can direct that power into other Advancement functions.

How about alumni event attendance?

I’ve had great success with this new model, which scores all of our alumni according to how likely they are to attend an event.  I’ll show you what we use it for, and then I’ll bounce a cool idea off you for your thoughts.

If you’ve read some earlier posts, you will already know that event attendance is highly correlated with giving (for our institution – but probably yours as well). Event attendance is an excellent predictor of giving, but it works the other way too: giving is a predictor of propensity to attend events.

We can say this because when we build our models we’re concerned only with correlation, not causation. It would be incorrect for me to say that attending events causes an alum to give, or vice-versa. I don’t know enough to make a statement either way. It could be that both behaviours spring from other influences. It’s enough for our purposes to say that they’re linked in a meaningful way.

To create an event attendance likelihood model you need at least a few years of actual attendance data. I was lucky – I had Homecoming data going back to 1999, as well as a few years of data for alumni receptions across the country. (Gathering this data pays off in many ways besides predictive modeling. See my earlier post, Why you should capture alumni event attendance in your database.)

I gave a lot of thought as to whether I should consider Homecoming and off-campus receptions separately. Clearly they are not the same thing, and perhaps should not have been weighted equally. However, for the sake of simplicity, I regarded all events as the same when I calculated my predicted value (‘number of events attended’). As long as an alumnus/na had to RSVP for the event AND showed up, they got a point for that event.

Another consideration is opportunity. To validly count off-campus events, ALL alumni should have at least had the option to attend an event. It is true that there are many cities where we have yet to host an event. However, I reasoned, we’ve hosted events in many of the towns and cities where the majority of our alumni live (or can reasonably travel to). Therefore I chose to include receptions along with Homecoming. Was I wrong? Not sure!

(Events I chose to leave out were of the exclusive, invite-only type. Because not all alumni were given the opportunity to attend, those events are not suitable to use in this model.)

You create a new model whenever you change the predicted value. Whether you use Peter Wylie’s simple-score method or multiple regression to create your model, when you make “number of events attended” your predicted value, your resulting score set will help to rank all alumni by how likely they are to show up to your event.

Here’s how we use those scores.

Photo courtesy of Alumnae Association of Mount Holyoke College (Creative Commons licence)

Let’s say the Alumni Office wants to send out invitations for Homecoming or for a reception in a city somewhere. Email is a no-brainer. It’s cheap and fast, and alumni of all ages seem very receptive to receiving communications that way.

Naturally we still mail out paper invitations, but for various reasons (cost being supreme), we have to be more selective. Some criteria we use for selecting who will get a mailing are included in the list below. The criteria are adjusted to be more or less restrictive, depending on what our target for mail pieces is.

  • Lifetime household and business giving $x and up
  • Member of donor recognition group in a recent year
  • Has a Planned Giving commitment
  • Identified as an ‘involved’ young alumnus/na
  • Attended Homecoming once in past ‘x’ years
  • Attended a previous event in region

The problem with these criteria is that so many alumni (particularly young alumni) might attend our event but aren’t donors and have never attended an event before. If the goal is attracting new faces to your event, you need some way to segment the ‘willing’ from the disinterested masses, and give them the extra attention they deserve.

This is where predictive modeling shines. I’ll have more to say about building this model later.

Now I want to bounce a cool idea off you. Let’s say you’ve created your model, scored all your alumni, and have since then put on several large events. Those events have generated actual attendance data. Let’s say you use this attendance data to work out the ‘percentage attended’ for each score level. Would that not provide you with a rough estimate of projected attendance for any given invitation list in the future? With incremental adjustments over time, and perhaps for different event types, would this be a valid tool your event planners could use?

I want to know!

An example. Let’s say you have an event coming up in Los Angeles, and your invitation list for that city includes 200 alumni who have a score of 10 in the Event Likelihood Model. You know from past events that 20% of alumni with that score will show up. Therefore you expect to see about 40 of them in Los Angeles. You add in 12% for the next level, 8% for the next level, and so on, and sum it all up to get your total projected attendance.

Valid? Not valid?

11 January 2010

The 15 top predictors for Planned Giving – Part 3

Okay, time to deliver on my promise to divulge the top 15 predictor variables for propensity to enter a Planned Giving commitment.

Recall the caveat about predictors that I gave for Annual Giving: These variables are specific to the model I created for our institution. Your most powerful predictors will differ. Try to extract these variables from your database for testing, by all means, but don’t limit yourself to what you see here.

In Part 2, I talked about a couple of variables based on patterns of giving. The field of potential variables available in giving history is rich. Keep in mind, however, that these variables will be strongly correlated with each other. If you’re using a simple-score method (adding 1 to an individual’s score for each positively-correlated predictor variable), be careful about using too many of them and exaggerating the importance of past giving. On the other hand, if you use a multiple regression analysis, these related variables will interact with each other – this is fine, but be aware that some of your hard-won variables may be reduced to complete insignificance.

Just another reason to look beyond giving history!

For this year’s Planned Giving propensity model, the predicted value (‘Y’) was a 0/1 binary value: “1” for our existing commitments, “0” for everyone else. (Actually, it was more complicated than that, but I will explain why some other time.)

The population was composed of all living alumni Class of 1990 and older.

The list

The most predictive variables (roughly in order of influence) are listed below. Variables that have a negative correlation are noted N. Note that very few of these variables can be considered continuous (eg. Class Year) or ordinal (survey scale responses). Most are binary (0/1). But ALL are numeric, as required for regression.

  1. Total lifetime giving
  2. Number of Homecomings attended
  3. Response to alumni survey scale question, regarding event attendance
  4. Number of President’s Receptions attended
  5. Class Year (N)
  6. Recency: Gave in the past 3 years
  7. Holds another degree from another university (from survey)
  8. Marital status ‘married’
  9. Prefix is Religious (Rev., etc.) or Justice
  10. Alumni Survey Engagement score
  11. Business phone present
  12. Number of children under 18 (from survey) (N)

Like my list of Annual Giving predictors, this isn’t a full list (and it isn’t 15 either!). These are the most significant predictors which don’t require a lot of explanation.

Note how few of these variables are based on giving – ‘Years of giving’ and ‘Frequency of giving’ don’t even rate. (‘Lifetime giving’ seems to take care of most of the correlation between giving and Planned Giving commitment.) And note how many variables don’t even come from our database: They come from our participation in a national survey for benchmarking of alumni engagement (conducted in March 2009).

Older Posts »

Create a free website or blog at