CoolData blog

20 September 2012

When less data is more, in predictive modelling

When I started doing predictive modelling, I was keenly interested in picking the best and coolest predictor variables. As my understanding deepened, I turned my attention to how to define the dependent variable in order to really get at what I was trying to predict. More recently, however, I’ve been thinking about refining or limiting the population of constituents to be scored, and how that can help the model.

What difference does it make who gets a propensity score? Up until maybe a year ago, I wasn’t too concerned. Sure, probably no 22-year-old graduate had ever entered a planned giving agreement, but I didn’t see any harm in applying a score to all our alumni, even our youngest.

Lately, I’m not so sure. Using the example of a planned gift propensity model, the problem is this: Young alumni don’t just get a score; they also influence how the model is trained. If all your current expectancies were at least 50 before they decided to make a bequest, and half your alumni are under 30 years old, then one of the major distinctions your model will make is based on age. ANY alum over 50 is going to score well, regardless of whether he or she has any affinity to the institution, simply because 100% of your target is in that age group.

The model is doing the right thing by giving higher scores to older alumni. If ages in the sample range from 21 to 100+, then age as a variable will undoubtedly contribute to a large chunk of the model’s ability to “explain” the target. But this hardly tells us anything we didn’t already know. We KNOW that alumni don’t make bequest arrangements at age 22, so why include them in the model?

It’s not just the fact that their having a score is irrelevant. I’m concerned about allowing good predictor variables to interact with ‘Age’ in a way that compromises their effectiveness. Variables are being moderated by ‘Age’, without the benefit of improving the model in a way that we get what we want out of it.

Note that we don’t have to explicitly enter ‘Age’ as a variable in the model for young alumni to influence the outcome in undesirable ways. Here’s an example, using event attendance as a predictor:

Let’s say a lot of very young alumni and some very elderly constituents attend their class reunions. The older alumni who attend reunions are probably more likely than their non-attending classmates to enter into planned giving agreements — for my institution, that is definitely the case. On the other hand, the young alumni who attend reunions are probably no more or less likely than their non-attending peers to consider planned giving — no one that age is a serious prospect. What happens to ‘event attendance’ as a predictor in which the dependent variable is ‘Current planned giving expectancy’? … Because a lot of young alumni who are not members of the target variable attended events, the attribute of being an event attendee will be associated with NOT being a planned giving expectancy. Or at the very least, it will considerably dilute the positive association between predictor and target found among older alumni.

I confirmed this recently using some partly made-up data. The data file started out as real alumni data and included age, a flag for who is a current expectancy, and a flag for ‘event attendee’. I massaged it a bit by artificially bumping up the number of alumni under the age of 50 who were coded as having attended an event, to create a scenario in which an institution’s events are equally popular with young and old alike. In a simple regression model with the entire alumni file included in the sample, ‘event attendance’ was weakly associated with being a planned giving expectancy. When I limited the sample to alumni 50 years of age and older, however, the R squared statistic doubled. (That is, event attendance was about twice as effective at explaining the target.) Conversely, when I limited the sample to under-50s, R squared was nearly zero.

True, I had to tamper with the data in order to get this result. But even had I not, there would still have been many under-50 event attendees, and their presence in the file would still have reduced the observed correlation between event attendance and planned giving propensity, to no useful end.

You probably already know that it’s best not to lump deceased constituents in with living ones, or non-alumni along with alumni, or corporations and foundations along with persons. They are completely distinct entities. But depending on what you’re trying to predict, your population can fruitfully be split along other, more subtle distinctions. Here are a few:

  • For donor acquisition models, in which the target value is “newly-acquired donor”, exclude all renewed donors. You strictly want to have only newly-acquired donors and never-donors in your model. Your good prospects for conversion are the never-donors who most resemble the newly-acquired donors. Renewed donors don’t serve any purpose in such a model and will muddy the waters considerably.
  • Conversely, remove never-donors from models that predict major giving and leadership-level annual giving. Those higher-level donors tend not to emerge out of thin air: They have giving histories.
  • Looking at ‘Age’ again … making distinctions based on age applies to major-gift propensity models just as it does to planned giving propensity: Very young people do not make large gifts. Look at your data to find out at what age donors were when they first gave $1,000, say. This will help inform what your cutoff should be.
  • When building models specifically for Phonathon, whether donor-acquisition or contact likelihood, remove constituents who are coded Do Not Call or who do not have a valid phone number in the database, or who are unlikely to be called (international alumni, perhaps).
  • Exclude international alumni from event attendance or volunteering likelihood models, if you never offer involvement opportunities outside your own country or continent.

Those are just examples. As for general principles, I think both of the following conditions must be met in order for you to gain from excluding a group of constituents from your model. By a “group” I mean any collection of individuals who share a certain trait. Choose to exclude IF:

  1. Nearly 100% of constituents with the trait fall outside the target behaviour (that is, the behaviour you are trying to predict); AND,
  2. Having a score for people with that trait is irrelevant (that is, their scores will not result in any action being taken with them, even if a score is very low or very high).

You would apply the “rules” like this … You’re building a model to predict who is most likely to answer the phone, for use by Phonathon, and you’re wondering what to do with a bunch of alumni who are coded Do Not Call. Well, it stands to reason that 1) people with this trait will have little or no phone contact history in the database (the target behaviour), and 2) people with this trait won’t be called, even if they have a very high contact-likelihood score. The verdict is “exclude.”

It’s not often you’ll hear me say that less (data) is more. Fewer cases in your data file will in fact tend to depress your model’s R squared. But your ultimate goal is not to maximize R squared — it’s to produce a model that does what you want. Fitting the data is a good thing, but only when you have the right data.

Advertisement

6 June 2012

How you measure alumni engagement is up to you

Filed under: Alumni, Best practices, Vendors — Tags: , , , — kevinmacdonell @ 8:02 am

There’s been some back-and-forth on one of the listservs about the “correct” way to measure and score alumni engagement. An emphasis on scientific rigor is being pressed for by one vendor who claims to specialize in rigor. The emphasis is misplaced.

No doubt there are sophisticated ways of measuring engagement that I know nothing about, but the question I can’t get beyond is, how do you define “engagement”? How do you make it measurable so that one method applies everywhere? I think that’s a challenging proposition, one that limits any claim to “correctness” of method. This is the main reason that I avoid writing about measuring engagement — it sounds analytical, but inevitably it rests on some messy, intuitive assumptions.

The closest I’ve ever seen anyone come is Engagement Analysis Inc., a firm based here in Canada. They have a carefully chosen set of engagement-related survey questions which are held constant from school to school. The questions are grouped in various categories or “drivers” of engagement according to how closely related (statistically) the responses tend to be to each other. Although I have issues with alumni surveys and the dangers involved in interpreting the results, I found EA’s approach fascinating in terms of gathering and comparing data on alumni attitudes.

(Disclaimer: My former employer was once a client of this firm’s but I have no other association with them. Other vendors do similar and very fine work, of course. I can think of a few, but haven’t actually worked with them, so I will not offer an opinion.)

Some vendors may make claims of being scientific or analytically correct, but the only requirement of quantifying engagement is that it be reasonable, and (if you are benchmarking against other schools) consistent from school to school. In general, if you want to benchmark, then engage a vendor if you want to do it right, because it’s not easily done.

But if you want to benchmark against yourself (that is, over time), don’t be intimidated by anyone telling you your method isn’t good enough. Just do your own thing. Survey if you like, but call first upon the real, measurable activities that your alumni participate in. There is no single right way, so find out what others have done. One institution will give more weight to reunion attendance than to showing up for a pub night, while another will weigh all event attendance equally. Another will ditch event attendance altogether in favour of volunteer activity, or some other indicator.

Can anyone say definitively that any of these approaches are wrong? I don’t think so — they may be just right for the school doing the measuring. Many schools (mine included) assign fairly arbitrary weights to engagement indicators based on intuition and experience. I can’t find fault with that, simply because “engagement” is not a quantity. It’s not directly measurable, so we have to use proxies which ARE measurable. Other schools measure the degree of association (correlation) between certain activities and alumni giving, and base their weights on that, which is smart. But it’s all the same to me in the end, because ‘giving’ is just another proxy for the freely interpretable quality of “engagement.”

Think of devising a “love score” to rank people’s marriages in terms of the strength of the pair bond. A hundred analysts would head off in a hundred different directions at Step 1: Defining “love”. That doesn’t mean the exercise is useless or uninteresting, it just means that certain claims have to be taken with a grain of salt.

We all have plenty of leeway to chose the proxies that work for us, and I’ve seen a number of good examples from various schools. I can’t say one is better than another. If you do a good job measuring the proxies from one year to the next, you should be able to learn something from the relative rises and falls in engagement scores over time and compared between different groups of alumni.

Are there more rigorous approaches? Yes, probably. Should that stop you from doing your own thing? Never!

26 January 2012

More mistakes I’ve made

Filed under: Best practices, Peter Wylie, Pitfalls, Validation — Tags: , , , — kevinmacdonell @ 1:38 pm

A while back I wrote a couple of posts about mistakes I’ve made in data mining and predictive modelling. (See Four mistakes I have made and When your predictive model sucks.) Today I’m pleased to point out a brand new one.

The last days of work leading up to Christmas had me evaluating my new-donor acquisition models to see how well they’ve been working. Unfortunately, they were not working well. I had hoped — I had expected — to see newly-acquired donors clustered in the upper ranges of the decile scores I had created. Instead they were scattered all along the whole range. A solicitation conducted at random would have performed nearly as well.

Our mailing was restricted by score (roughly the top two deciles only), but our phone solicitation was more broad, so donors came from the whole range of deciles:

Very disappointing. To tell the truth, I had seen this before: A model that does well predicting overall participation, but which fails to identify which non-donors are most likely to convert. I am well past the point of being impressed by a model that tells me what everyone already knows, i.e. that loyal donors are most likely to give again. I want to have confidence that acquisition mail dollars are spent wisely.

So it was back to the drawing board. I considered whether my model was suffering from overfit, whether perhaps I had too many variables, too much random noise, multicolinearity. I studied and rejected one possibility after another. After so much effort, I came rather close to concluding that new-donor acquisition is not just difficult — it might be darn near impossible.

Dire possibility indeed. If you can’t predict conversion, then why bother with any of this?

It was during a phone conversation with Peter Wylie that things suddenly became clear. He asked me one question: How did I define my dependent variable? I checked, and found that my DV was named “Recent Donors.” That’s all it took to find where I had gone wrong.

As the name of the DV suggested, it turned out that the model was trained on a binary variable that flagged anyone who had made a gift in the past two years. The problem was that included everybody: long-time donors and newly-acquired donors alike. The model was highly influenced by the regular donors, and the new donors were lost in the shuffle.

It was a classic case of failing to properly define the question. If my goal was to identify the patterns and characteristics of newly-acquired donors, then I should have limited my DV strictly to non-donors who had recently converted to donors!

So I rebuilt the model, using the same data file and variables I had used to build the original model. This time, however, I pared the sample down to alumni who had never given a cent before fiscal 2009. They were the only alumni I needed to have scores for. Then I redefined my dependent variable so that non-donors who converted, i.e., who made a gift in either fiscal 2009 or 2010, were coded ‘1’, and all others were coded ‘0’. (I used two years of giving data instead of just one in order to have a little more data available for defining the DV.) Finally, I output a new set of decile scores from a binary logistic regression.

A test of the new scores showed that the new model was a vast improvement over the original. How did I test this? Recall that I reused the same data file from the original model. Therefore, it contained no giving data from the current fiscal year; the model was innocent of any knowledge of the future. Compare this breakdown of new donors with the one above:

Much better. Not fan-flippin-tastic, but better.

My error was a basic one — I’ve even cautioned about it in previous posts. Maybe I’m stupid, or maybe I’m just human. But like anyone who works with data, I can figure out when I’m wrong. That’s a huge advantage.

  • Be skeptical about the quality of your work.
  • Evaluate the results of your decisions.
  • Admit your mistakes.
  • Document your mistakes and learn from them.
  • Stay humble.

3 October 2011

Data 1, Gut Instinct 0

Filed under: Annual Giving, Best practices, Phonathon — Tags: , , — kevinmacdonell @ 8:30 am

Sometimes I employ a practice in our Phonathon program simply because my gut says it’s gotta work. Some things just seem so obvious that it doesn’t seem worthwhile testing them to prove my intuition is valid. And like a lot of people who work in Annual Giving, I like to flatter myself that I can make a non-engaged alum give just by making shrewd tweaks to the program.

It turns out that I am quite wrong. I am thinking about a practice that seems to be part of the Phonathon gospel of best practices. I firmly believed in it, and I got serious about using it this fall. As the song says, though, it ain’t necessarily so.

When possible, I am pairing up student callers with alumni whose degree is in the same faculty of study. If I have business students in the calling room, for example, I’ll assign them alumni with degrees associated with the Faculty of Management. A grad with a BSc majoring in chemistry, meanwhile, will get a call from a student majoring in one of the sciences, rather than arts or business. It’s not perfect: There are too many degree programs, current and historic, for me to get any more specific than the overall faculty, but at least it increases the chance that student and alum will have something in common to talk about.

It’s easy to see why this ought to work. When speaking with young alumni, callers are somewhat more likely to have had certain professors or classes in common, and their interests may be aligned — for example, the alum might be able to provide the student with a glimpse into the job market that awaits. With older alumni, the callers might at least know the campus and buildings that alumni of the past inhabited just as they do today. If alumni feel so inclined, the conversation might even lead to a discussion about life and career.

These would be meaningful conversations, the kind of connection we hope to achieve on the phone. Just that much, even without a gift (this year), would be a desirable result.

On the other hand … if faculty pairings really lead to longer, better-quality conversations, would we not expect that faculty-paired conversations would, on average, result in more gifts than non-paired conversations? In the long run, is that not our goal? If it makes no difference who asks whom, then why complicate things?

First let me say that I embarked on this analysis fully expecting that the data would demonstrate the effectiveness of faculty-paired conversations. I might be a data guy, but I am not unbiased! I really hoped that my intervention would actually produce results. Allow me to admit that I was quite disappointed by what I found.

Here’s what I did.

Last year, I did not employ faculty pairings. We made caller assignments based on prospects’ donor status (LYBUNT, SYBUNT, etc.), but not faculty. I don’t know how our automated software distributes prospects to callers, but I am comfortable saying that, with regards to the faculty of preferred degree, the distribution to callers was random. This more or less random assignment by faculty allowed me to compare “paired” conversations with “unpaired” conversations, to see whether one was better than the other with regards to length of call, participation rate, and average pledge.

I dug into the database of our automated calling application and I pulled a big file with every single call result for last year. The file included the caller’s ID, the length of the call in seconds, the last result (Yes Pledge, No Pledge, No Answer, Answering Machine, etc. etc.), and the pledge amount (if applicable).

Then I removed all the records that did not result in an actual conversation. If the caller didn’t get to speak to the prospect, faculty pairing is irrelevant. I kept any record that ended in Yes Pledge (specified-dollar pledge or credit card gift), Maybe (unspecified pledge), No Pledge, or a request to be put on the Do Not Call list.

I added two more columns (variables) of data: The faculty of the caller’s area of study, and the faculty of the prospect’s preferred degree. Because not all of our dozen or so faculties is represented in our calling room, I then removed all the records for which no pairing was possible. For example, because I employed no Law or Medicine students, 100% of our Law and Medicine alumni would end up on the “non-paired” side, which would skew the results.

As well, I excluded calls with call lengths of five seconds or less. It is doubtful callers would have had enough time to identify themselves in less than five seconds — therefore those calls do not qualify as conversations.

In the end, my data file for analysis contained the results of 6,500 conversations for which a pairing was possible. Each prospect record, each conversation, could have one of two states: ‘Paired’ or ‘Unpaired’. About 1,500 conversations (almost 22%) were paired, as assigned at random by the calling software.

I then compared the Paired and Unpaired groups by talk time (length of call in seconds), participation, and size of pledge.

1. Talk time

Better rapport-building on the phone implies longer, chattier calls. According to the table, “paired” calls are indeed longer on average, but not by much. A few seconds maybe.

2. Participation rate

The donor rates you see here are affected by all the exclusions, especially that of some key faculties. However, it’s the comparison we’re interested in, and the results are counter-intuitive. Non-paired conversations resulted in a slightly higher rate of participation (number of Yes Pledges divided by total conversations).

3. Average and median pledge

This table is also affected by the exclusion of a lot of alumni who tend to make larger pledges. Again, though, the point is that there is very little difference between the groups in terms of the amount given per Yes pledge.

The differences between the groups are not significant. Think about the range of values your callers get for common performance metrics (pledge rate, credit card rate, talk time, and so on). There are huge differences! If you want to move the yardsticks in your program, hire mature, friendly, chatty students who love your school and want to talk about it. Train them well. Keep them happy. Reward them appropriately. Retain them year over year so they develop confidence. These are the interventions that matter. Whom they are assigned to call doesn’t matter nearly as much.

Over and above that, pay attention to what matters even more than caller skills: The varying level of engagement of individual alumni. Call alumni who will answer the phone. Call alumni who will give a gift. Stop fussing over the small stuff.

You know what, though? Even faced with this evidence, I will probably continue to pair up students and alumni by faculty. First of all, the callers love it. They say they’re having better conversations, and I’m inclined to believe them. It’s not technically difficult to match up by faculty, so why not? As well, there might be nuances that I overlooked in my study of last year’s data. Maybe the faculty pairings are too broad. (Anytime you find economics in the same faculty as physics, you have to wonder how some people define Science. A discussion for someone else’s blog, perhaps.)

But my study has cast doubt on the usefulness of going to any great length to target alumni by faculty. For example, should I try hard to recruit a student caller from Law or Medicine to maximize on alumni from those faculties? Probably not.

Finally, I caution readers not to interpret my results as being generally applicable. I’m not saying that faculty pairing as a best practice is invalid. You need to determine for yourself whether a practice is part of your core strategy, or just a refinement, or completely useless. As I opined in my previous post (Are we too focused on trivia?), I suspect a lot of Annual Fund professionals aren’t making these distinctions.

The answers are in the data. Go find them.

28 September 2011

Are we too focused on trivia?

Filed under: Annual Giving, Best practices, Front-line fundraisers — kevinmacdonell @ 7:21 am

As Annual Fund professionals we like to think that the precise details of our approach to prospective donors makes a difference in our rate of success. Some of our practices make so much sense, are so in tune with our instincts, that it seems absurd to bother testing them. Sometimes, though, a look at the data reveals that our carefully-crafted techniques aimed at engaging, convincing and converting make little or no difference.

At least, they make little difference when compared to what really matters: The emotions, opinions and feelings that would-be donors have when they think of our institution, organization or cause.

We should not be surprised that these feelings and emotions are not significantly influenced by whether we pay postage on the return envelope, or have Dr. So-and-So sign the letter, or many other, similar nuances that are the subject of the bulk of discussions on listservs that deal with Annual Fund and fundraising in general.

Yes, there are right and wrong ways to communicate with donors and would-be donors, but on the whole we have a hard time distinguishing between meaningful practices and mere refinements. We tinker with our letterhead, our brand, our scripts. We keep changing the colour of our sails in hopes the ship will go faster.

What is the non-trivial work we need to do? We need to get a whole lot better at identifying who likes us, and pay attention only to them. If they like us a lot, we need to ask them, thank them, upgrade them, stay with them on the journey — as all our fundraising experience and human instincts guide us to do. If they like us a little, perhaps we can do something to engage them. If they are indifferent, we must simply walk away.

That does not mean we should pay attention only to donors: There are all kinds of people who haven’t given, but will someday. They reveal their affinity in ways that most fundraisers don’t take into account. And among donors, these clues regarding affinity help define the donor who is ready to give much more, or remain loyal for a lifetime, or even leave a bequest.

I’m as guilty as anyone else. There are things I do in my Phonathon program only because they make strong intuitive sense and have no basis in the evidence of results. In my next post, I will give an example of a Phonathon “best practice” which seems beyond reproach but which (according to my data) has absolutely no effect on participation or pledge amount. I was surprised by what I learned from my study, and I think you’ll be too.

16 June 2011

Benchmark or bust

Filed under: Best practices — Tags: — kevinmacdonell @ 8:40 am

About a week ago I did something I’ve rarely done before. I said a “No” in response to a request for information.

The request was legitimate, and not at all unusual. My counterpart at another university was gathering Phonathon performance numbers from various universities, and they wanted to see ours. Initially I agreed, and so she sent me a detailed questionnaire. In the end, though, my response was, “Thanks, but no thanks.”

Normally my bias is to be as helpful and “sharing” as possible. Why the brick wall all of a sudden? Well, I had asked the person distributing the questionnaire if the participants would be getting a summary of responses; I didn’t need to know who the participants were, I just wanted the results. The answer was sorry, but no — they hadn’t arranged permissions with other participants prior to gathering their information. She could provide her own institution’s numbers, but no one else’s.

I hate to be a party pooper, but count me out.*

I’m picking on this recent and innocent requester only as an example of what I find is becoming an issue. It’s not just that we are sometimes asked to fill out questionnaires. It’s that every day, you and I and everyone else is asked to chime in on one more “quick survey”. The listservs are full of them. A lot of inter-university surveying goes on, especially at this time of year. How much of it is of any value to the programs that offer up the information, or of lasting value to the ones who receive it? Not much, I’d say.

“What was your participation rate this year?” That’s a common one on the listservs. I never respond to those queries because immediately all I have are more questions: How do you calculate participation? What’s the composition of the population you called? And so on. It’s not worth starting the discussion, because invariably the asker wants quick and dirty answers.

In place of these informal, one-off surveys, even the carefully thought-out ones, I would prefer to see more effort put into proper benchmarking. First, everyone who takes the trouble to provide their numbers ought to receive the benefit of a set of results (anonymous if need be). And second, one-off questionnaires tend to gloss over important differences between institutions, programs, and appeals, leading to invalid observations and comparisons.

The discipline imposed by benchmarking forces everyone to agree on exact definitions of terms and the exact methods of calculating the metrics — otherwise the results are not comparable. Even apparently self-evident terms such as “alumnus”, “donor”, or “acquisition” are devilishly hard to standardize across institutions.

The standardization of data makes benchmarking hard, and slow to get done. No wonder some prominent analytics vendors sell program benchmarking as a service. So I get it: Sometimes we just want a rough answer, often only to satisfy someone higher-up. That doesn’t make it right.

We can fill out each others’ questionnaires and respond to quickie listserv surveys all day, but at some point we need to conserve our limited time for something more useful. Share? Yes, but share smarter. From now on, your first question should be about how the results will be distributed to participants. Not even full-on benchmarking, just sharing back. If there’s no plan, then do yourself a favour and decline.

* P.S.: In this case, the requester went back to the participants and arranged to share the information with everyone, and I relented.

« Newer PostsOlder Posts »

Create a free website or blog at WordPress.com.