CoolData blog

17 January 2010

Proving ‘event attendance likelihood’ actually works

Filed under: Event attendance, Model building, Predictive scores, skeptics — Tags: , , , , — kevinmacdonell @ 6:56 pm

In an earlier post I talked about what you need to get started to build an ‘event attendance likelihood’ model. Today I want to provide some evidence to back up my claim that yes, you can identify which segment of your alumni population is most likely to attend your future event.

To recap: Every living, addressable alumnus/na in our database is scored according to how likely he or she is to attend an event, whether it be a President’s Reception or Homecoming, whether they’ve ever attended an event or not.

The scores can be used to answer these types of questions:

• What’s the top 30% of alumni living in Toronto who should be mailed a paper invite to the President’s Reception?
• Who are the 50 members of the Class of 2005 who are most likely to come to Homecoming for their 5th-year reunion?

I built our first event-attendance model last summer. As I always do, I divided all our alumni into deciles by the predicted values that are produced by the regression analysis (the ‘raw score’). The result is that all alumni were ranked from a high score of 10 (most likely to attend an event) to 1 (least likely).

At that time, alumni were sending in their RSVPs for that fall’s Homecoming event. Because I use only actual-attendance data in my models, these RSVPs were not used as a source of data. … That made Homecoming 2009 an excellent test of the predictive strength of the new model.

Have a look at this chart, which shows how much each decile score contributed to total attendance for Homecoming 2009. The horizontal axis is Decile Score, and the vertical axis is Percentage of Attendees. Almost 45% of all alumni attendees had a score of 10 (the bar highlighted in red).

(A little over 4% of alumni attendees had no score. Most of these would have been classified as ‘lost’ when the model was created, and therefore were excluded at that time. In the chart, they are given a score of zero.)

To put it another way, almost three-quarters of all alumni attendees have a score of 8 or higher. But those 10 scores are the ones who really stand out.

Let me anticipate an objection you might have: Those high-scoring alumni are just the folks who have shown up for events in the past. You might say that the model is just predicting that past attendees are going to attend again.

Not quite. In fact, a sizable percentage of the 10-scores who attended Homecoming had never attended an event before: 23.1%.

The chart below shows the number of events previously attended by the 10-scored alumni who were at Homecoming in 2009. The newbies are highlighted in red.

The majority of high-scoring attendees had indeed attended previous events (a handful had attended 10 or more!). But that one-quarter hadn’t – and were still identified as extremely likely to attend in future.

That’s what predictive modeling excels at: Zeroing in on the characteristics of individuals who have exhibited a desired behaviour, and flagging other individuals from the otherwise undifferentiated masses who share those characteristics.

Think of any ‘desired behaviour’ (giving to the annual fund, giving at a higher level than before, attending events, getting involved as an alumni volunteer), then ensure you’ve got the historical behavioural data to build your model on. Then start building.

15 December 2009

Why you should use deciles and percentiles for scores

Filed under: Annual Giving, Planned Giving, Predictive scores — Tags: , , , — kevinmacdonell @ 11:13 am

The predictive modeling method I use (multiple regression) results in a “raw score” that is great for very fine ranking, because it will probably produce almost as many score levels as there are individuals in your sample. But it doesn’t work at all for other purposes.

For example, you can’t use ‘raw score’ to observe how a person’s or a group’s propensity to give changes from year to year. Your model changes over time, and so will the output. What does it mean if Joe’s raw score goes from 6349 to 9032? Not much. The value of the score itself has no practical meaning.

Because the values are not easy to explain to end-users, and because they change so much from year to year, you need to provide a more intuitive scoring system.

If you take everyone in the sample and divide them up into groups of roughly equal numbers, by their raw score, you produce a much more useful ranking.

Equal quarters are quartiles, equal fifths are quintiles, and so on. For our needs, equal tenths (deciles) and equal hundredths (percentiles) are the most useful.

For example, if Joe goes from the 60th percentile last year to the 93rd percentile this year, that’s a meaningful change.

But usually we’re not as interested in the score of a single individual as we are in getting a handle on a whole segment of a population. If your annual giving coordinator knows that the 9th and 10th deciles are always where the money is, regardless of how your model changes in any given year, you’ve taken a big step towards clarity. If your results aren’t clear, no one will embrace them.

Which type of predictive score you would use, deciles or percentiles, depends on how selective you need to be:

• An Annual Giving manager trying to prioritize groups for the Telethon campaign might focus on the top one, two or three deciles. That represents thousands of alumni whose raw propensity-to-give scores place them in the top 10% to 30% of the population.
• A Planned Giving Officer trying to zero in on the best prospects might focus on no more than the top 1-5% of the population. For that person,  percentiles will provide a much sharper knife.

When I produce a set of scores, I usually provide all three types, because one can’t always anticipate needs. The screen in Banner that holds the scores (APAEXRS) is able to accommodate three ‘flavours’ of scores, so I usually upload raw, decile, and percentile scores for each model.

Unfortunately, the output scores of a regression model are messy! They have to be worked on a bit in order to whip them into shape before you upload them to the database. Here’s what they often look like in their ‘unprocessed’ state:

```0.14878468 0.14879054 0.1488943 0.14901018 0.14901177 0.14901665 0.1491403 0.14914408 ```

The Banner field I upload scores to is able to contain four digits. So for ‘raw score’, I create a derived variable in Data Desk that multiplies these values by a thousand and rounds to the nearest whole number.

Deciles and percentiles are not exactly available at the push of a button, either. In future posts I will describe the methods I’ve been taught to produce a nice, clean set of scores for upload.