CoolData blog

20 August 2012

Logistic regression vs. multiple regression

Filed under: John Sammis, Model building, Peter Wylie, predictive modeling, regression, Statistics — kevinmacdonell @ 5:13 am

by Peter Wylie, John Sammis and Kevin MacDonell

(Click to download printer-friendly PDF: Logistic vs MR-Wylie Sammis MacDonell)

The three of us talk about this issue a lot because we encounter a number of situations in our work where we need to choose between these two techniques. Many of our late night/early morning phone/internet discussions have been gobbled up by talking about which technique seems to be better under what circumstances. More than a few times, I’ve suggested we write something up about our experience with both techniques. In the end we’ve always decided to put off doing that because … well, because we’ve thought it might put a lot of people to sleep. Disagree as we might about lots of things, we’re of one mind on the dictum: “Don’t bore people.” They have enough tedious stuff in their lives; we don’t need to add to their burden.

On the other hand, as analytics has started to sink its teeth more and more into the world of advancement, it seems there is a group of folks out there who wrestle with the same issue. And the issue seems to be this:

“If I have a binary dependent variable (e.g., major giver/ non major giver, volunteer/non-volunteer, reunion attender/non-reunion attender, etc.), which technique should I use? Logistic regression or multiple regression?”

We considered a number of ways to try to answer this question:

  • We could simply assert an opinion based on our bank of experience with both techniques.
  • We could show you the results of a number of data sets using both techniques and then offer our opinion.
  • We could show you a way to compare both techniques using some of your own data.

We chose the third option because we think there is no better way to learn about a statistical technique than by using the technique on real data. Whenever we’ve done this sort of exploring ourselves, we’ve been humbled by how much we’ve learned.

Before we show you a way to compare the two techniques, we’ll offer some thoughts on why this question (“Should I use logistic regression or multiple regression?”) is so tough to find an answer to. If you’re anxious to move on to our comparison process, you can skip this section. But we hope you don’t.

Why This Is Not an Easy Question to Find an Answer To

We see at least two reasons why this is so:

  • Multiple regression has lived in the neighborhood a long time; logistic regression is a new kid on the block.
  • The articles and books we’ve read on comparisons of the two techniques are hard to understand.

Multiple regression is a longtime resident; logistic regression is a new kid on the block.

When World War II came along, there was a pressing need for rapid ways to assess the potential of young men (and some women) for the critical jobs that the military services were trying to fill. It was in this flurry of preparation that multiple regression began to see a great deal of practical application by behavioral scientists who had left their academic jobs and joined up for the duration. The theory behind multiple regression had been worked out much earlier in the century by geniuses like Ronald Fisher, Karl Pearson, and Edward Hotelling. But the method did not get much use until the war effort necessitated that use. The computational effort involved was just too forbidding.

Logistic regression is a different story. From the reading we’ve done, logistic regression got its early practical use in the world of medicine where biostatisticians were trying to predict binary outcomes like survived/did not survive, contracted disease/did not contract disease, had a coronary event/did not have a coronary event, and the like. It’s only been within the last fifteen or twenty years that logistic regression has found its way into the parlance of statisticians in the behavioral sciences.

These two paragraphs are a long way around of saying that logistic regression is (in our opinion) nowhere near as well vetted as is multiple regression by people like us in advancement who are interested in predicting behavior, especially giving behavior.

The articles and books we’ve read on comparisons of the two techniques are hard to understand.

Since I (Peter) was pushing to do this piece, John and I decided it would be my responsibility to do some searching of the more recent literature on logistic regression as it relates to the substance of this project.

To start off, I reread portions of texts I have accumulated over the years that focus on multiple regression as a general data analytic technique. Each text has a section on logistic regression. As I waded back into these sections, I asked myself: “Is what I’m reading here going to enlighten more than confuse the folks we have in mind for this piece?”  Without exception, my answer was, “Nope, just the reverse.” There was altogether too much focus on complicated equations and theory and nowhere near enough emphasis on the practical use of logistic regression. (This, in spite of the fact that each text had an introduction ensuring us the book would go light on math and heavy on application.)

Then, using my trusty iPad, I set about seeing what I could find on the web. Not surprisingly, I found a ton of articles (and even some full length books) that had found their way into the public domain. I downloaded a bunch of them to read whenever I could find enough time to dig into them. I’m sorry to report that each time I’d give one of these things a try, I would hear my father’s voice (dad graduated third in his class in engineering school) as he paged through my own science and math texts when I was in college: “They oughta teach the clowns who wrote these things to write in plain English.” (I always tried to use such comments as excuses for bad grades. Never worked.)

Levity aside, it is hard to find clearly written articles or books on the use of logistic versus multiple regression in the behavioral sciences. I think it’s a bad situation that needs fixing, but that fixing won’t occur anytime soon. On the other hand, I think dad was right not to let me off easy for giving up on badly written material. And you shouldn’t let my pessimism dissuade you from trying out some of these same articles and books. (If enough of you are interested, perhaps Kevin and John and I can put together a list of suggested readings.)

A Way to Compare Logistic Regression with Multiple Regression

As promised we’ll take you through a set of steps you can use with some of your own data:

  1. Pick a binary dependent variable and a set of predictors.
  2. Compute a predicted probability value for every record in your sample using both multiple regression and logistic regression.
  3. Draw three random subsamples of 20 records each from the total sample so that each subsample includes the predicted multiple regression probability value and the predicted logistic regression probability value for every record.
  4. Display each subsample of these records in a table and a graph.
  5. Do an eyeball comparison of the probability values in both the tables and the graphs.

1. Pick a binary dependent variable and a set of predictors.

For this example, we used a private four year institution with about 13,000 solicitable alums. Here are the variables we chose:

Dependent variable. Each alum who had given $31 or more lifetime was defined as 1, all others who had given less than that amount were defined as 0. There were 6,293 0’s and 6,204 1’s. Just about an even fifty/fifty split.

Predictor variables:

  • CLASS YEAR
  • SQUARE OF CLASS YEAR
  • EMAIL ADDRESS LISTED (YES/NO, 1=YES, 0=NO)
  • MARITAL STATUS (SINGLE =1, ALL OTHERS=0)
  • HOME PHONE LISTED (YES/NO, 1=YES, 0=NO)
  • UNIQUE ID NUMBER

Why did we use ID number as one of the predictors? Over the years we’ve found that many schools use all-numeric ID numbers. When these numbers are entered into a regression analysis, they often work as predictors. More importantly, they help to create very granular predicted scores that can easily be binned into equal size groups.

2. Compute a predicted probability value for every record in your sample using both multiple regression and logistic regression.

This is where things start to get a bit technical and where a little background reading on both multiple regression and logistic regression wouldn’t hurt. Again, most of the material you’ll find will be tough to decipher. Here we’ll keep it as simple as we can.

For both techniques the predicted value you want to generate is a probability, a number that varies between 0 and 1.  In this example, that value will represent the probability that a record has given $31 or more lifetime to the college.

Now here’s the rub, the logistic regression model will always generate a probability value that varies between 0 and 1. However, the multiple regression model will almost always generate a value that varies between something less than 0 (a negative number) and a number greater than 1. In fact, in this example the range of probability values for the logistic regression model extends from .037 to .948. The range of probability values for the multiple regression model extends from -.122 to 1.003.

(By the way, this is why so many statisticians advise the use of logistic regression over multiple regression when the dependent variable is binary. In essence they are saying, “A probability value can’t exceed 1 nor can it be less than 0. Since multiple regression often yields values less than 0 and greater than 1, use logistic regression.” To be fair, we’re exaggerating a bit, but not very much.)

3. Draw three random subsamples of 20 records each from the total sample so that each subsample includes the predicted multiple regression probability value and the predicted logistic regression probability value for all 20 records.

The size and number of these subsamples is, of course, arbitrary. We decided that three subsamples were better than two and that four or more would be overkill. Twenty records, as you’ll see a bit further on, is a number that allows you to see patterns in a table or graph without overcrowding the picture.

4. Display each subsample of these records in a table and a graph.

Tables 1-3 and Figures 1-3 below show how we took this step for our example. To make sure we’re being clear, let’s go through some of the details in Table 1 and Figure 1 (which we constructed for the first subsample of twenty randomly drawn records).

In Table 1 the probability values for multiple regression for each record are displayed in the left-hand column. The corresponding probability values for the same records for logistic regression are displayed in the right-hand column. For example, the multiple regression probability for the first record is .078827109. The record’s logistic regression probability is .098107437. In plain English, that means the multiple regression model for this example is saying that this particular alum has about eight chances in a hundred of giving $31 or more lifetime. The logistic regression model is saying that the same alum has about ten chances in a hundred of giving $31 or more lifetime.

Table 1: Predicted Probability Values Generated from Using Multiple Regression and Logistic Regression for the First of Three Randomly Drawn Subsamples of 20 Records

Figure 1 shows the pairs of values you see in Table 1 displayed graphically in a scatterplot. You’ll notice that the points in the scatterplot appear to fall along what roughly looks like a straight line. This means that the multiple regression model and the logistic regression model are assigning very similar probabilities to each of the 20 records in the subsample. If you study Table 1, you can see this trend, but the trend is much easier to discern in the scatter plot.

Table 2: Predicted Probability Values Generated from Using Multiple Regression and Logistic Regression for the Second of Three Randomly Drawn Subsamples of 20 Records

Table 3: Predicted Probability Values Generated from Using Multiple Regression and Logistic Regression for the Third of Three Randomly Drawn Subsamples of 20 Records

 

5. Do an eyeball comparison of the probability values in both the tables and the graphs.

We’ve already done such a comparison in Table 1 and Figure 1. If we do the same comparison for Tables 2 and 3 and for Figures 2 and 3, it’s pretty clear that we’ll come to the same conclusion: Multiple regression and logistic regression (for this example) are giving us very similar answers.

So Where Does This All Take Us?

We’d like to cover several topics in this closing section:

  • A frequent objection to using multiple regression versus logistic regression when the dependent variable is binary
  • Trying our approach on your own
  • The conclusion we think you’ll eventually arrive at
  • How we’ve just scratched the surface here

A frequent objection to using multiple regression versus logistic regression when the dependent variable is binary

Earlier we said that many statisticians seem to advise the use of logistic regression over multiple regression by invoking this logic: “A probability value can’t exceed 1 nor can it be less than 0. Since multiple regression often yields values less than 0 and greater than 1, use logistic regression.” We also said we were exaggerating the stance of these statisticians a bit (but not very much).

While we can understand this argument, our feeling is that, in the applied fields we toil in, that argument is not a very practical one. In fact a seasoned statistics professor we know says (in effect): “What’s the big deal? If multiple regression yields any predicted values less than 0, consider them 0. If multiple regression yields any values greater than 1, consider them 1. End of story.” We agree.

Trying our approach on your own

In this piece we’ve shown the results of one comparison between multiple and logistic regression on one set of data. It’s clear that the results we got for the two techniques were very similar. But does that mean we’d get such similar results with other examples? Not necessarily.

So here’s what we’d recommend. Try doing your own comparisons of the two techniques with:

  • Different data sets. If you’re a higher education institution, you might pick a couple of data sets, one for alums who’ve been out for more than 25 years and one for folks who’ve been out less than 10 years. If you’re a non-profit, you can use a set of members from the west coast and one from the east coast.
  • Different variables. Try different binary dependent variables like those we mentioned earlier: major giver/non major giver, volunteer/non-volunteer, reunion attender/non-reunion attender, etc. And try different predictors. Try to mix categorical variables like marital status with quantitative variables like age. If you’re comfortable with more sophisticated stats, try throwing in cross products and exponential terms.
  • Different splits in the dependent variable. In our example piece the dependent variable was almost an exact 50/50 split. Since the underlying variable we used was quantitative (lifetime giving), we could have adjusted those splits in a number of ways: 60/40, 75/25, 80/20, 95/5, and on and on the list could go. Had we tried these different kinds of splits, would we have the same kinds of results for the two techniques? Since we actually did look at different splits like these, we can report that the results for both techniques were pretty much the same. But that’s for this example. That could change with a different data set and different variables.

The conclusion we think you’ll eventually arrive at

We’re very serious about having you compare multiple regression and logistic regression on a variety of data sets with a variety of variables and with different splits in the dependent variable. If you do, you’ll learn a ton. Guaranteed.

On the other hand, if we put ourselves in your shoes, it’s easy to imagine your saying, “Come on guys. I’m not gonna do that. Just tell me what you think about which technique is better when the dependent variable is binary. Pick a winner.”

Given our experience, we can’t pick a winner. In fact, if pushed, we’re inclined to opt in favor of multiple regression for a couple of reasons. It not only seems to perform about as well as logistic regression, but more importantly (with the stats software we use) multiple regression is simply faster and easier to use than logistic regression. But we still use logistic regression for models with dependent variables. And we continue to compare its efficacy against multiple regression when we can. And we rarely see a meaningful difference between the results.

Why do we still use both modeling techniques? Because we think taking a hard and fast stance when you’re doing applied science is not a good idea. Too easy to end up with egg on your face. Our best advice is to use whichever method is most familiar and readily available to you.

As always, we welcome your comments and reactions. Maybe even more so with this one.

8 May 2012

Emerson’s big data

Filed under: Model building, Off on a tangent — Tags: — kevinmacdonell @ 11:35 am

One day in late March I got on a plane from Toronto (where I attended Annual Fund benchmarking meetings hosted by Target Analytics) to Las Vegas (for the Sungard Higher Education Summit), and picked up the Toronto Globe & Mail. I scanned a section that offered some ephemera, including the startling news that my fellow countryman William Shatner had turned 81. Once I got over that shock, I read the Globe’s “Thought du jour,” a quote from Ralph Waldo Emerson.

Because I’m an admirer of Emerson, and because I figured I could appropriate his quote for my own selfish purposes, I scribbled it down:

“The world can never be learned by learning all its details.”

Emerson did not live in the age of big data. But in a way, the world he experienced — the world we all experience through our senses — IS big data. We don’t perceive our surroundings directly, but only through our brain’s interpretations of sense impressions. We navigate the world via mental models of our own creation. These models leave out nearly everything. They are not reality, no more than a map of a city is faithful to the reality of the city, or than our memory of an event is faithful to the details of the event (which would overwhelm us every time it came to mind).

In our work with data, we measure things (or their proxies) in order to get a handle on them and in order to gain insight. We lose most of the detail in the process, but we need to in order to learn something. We build models based on general patterns. So as George E.P. Box said: All models are wrong, but some are useful.

18 April 2012

Stepwise, model-foolish?

Filed under: Model building, Pitfalls, regression, Software, Statistics — Tags: , — kevinmacdonell @ 8:00 am

My approach to building predictive models using multiple linear regression might seem plodding to some. I add predictor variables to the regression one by one, instead of using stepwise methods. Even though the number of predictor variables I use has greatly increased, and the time needed to build a model has lengthened, I am even less likely to use stepwise regression today than I was a few years ago.

Stepwise regression, available in most stats software, tosses all the predictor variables into the analysis at once and picks the best for you. It’s a semi-automated process that can work forwards or backwards, adding or deleting variables until it’s satisfied a statistical rule of thumb. The software should give you some control over the process, but mostly your computer is making all the big decisions.

I understand the allure. We’re all looking for ways to save time, and generally anything that automates a repetitive process is a good thing. Given a hundred variables to choose from, I wouldn’t be surprised if my software was able to get a better-fitting model than I could produce on my own.

But in this case, it’s not for me.

Building a decent model isn’t just about getting a good fit in terms of high R square. That statistic tells you how well the model fits the data that the model was built on — not data the model hasn’t yet seen, which is where the model does its work (or doesn’t). The true worth of the model is revealed only over time, but you’re more likely to succeed if you’ve applied your knowledge and judgement to variable selection. I tend to add variables one by one in order of their Pearson correlation with the target variable, but I am also aware of groups of variables that are highly correlated with each other and likely to cause issues. The process is not so repetitive that it can always be automated. Stepwise regression is more apt to select a lot of trivial variables with overlapping effects and ignore a significant predictor that I know will do the job better.

Or so I suspect. My avoidance of stepwise regression has always been due to a vague antipathy rather than anything based on sound technical concerns. This collection of thoughts I came across recently lent some justification of this undefined feeling: Problems with stepwise regression. Some of the authors’ concerns are indeed technical, but the ones that resonated the most for me boiled down to this: Automated variable selection divorces the modeler from the process so that he or she is less likely to learn things about their data. It’s just not as much fun when you’re not making the selections yourself, and you’re not getting a feel for the relationships in your data.

Stepwise regression may hold appeal for beginning modellers, especially those looking for push-button results. I can’t deny that software for predictive analysis is getting better and better at automating some of the most tedious aspects of model-building, particularly in preparing and cleaning the data. But for any modeller, especially one working with unfamiliar data, nothing beats adding and removing variables one at a time, by hand.

15 July 2011

Answering questions about “How many times to keep calling”

Filed under: Annual Giving, John Sammis, Model building, Peter Wylie, Phonathon, regression — kevinmacdonell @ 8:27 am

The recent discussion paper on Phonathon call attempts by Peter Wylie and John Sammis elicited a lot of response. There were positive responses. (“Well, that’s one of the best things I’ve seen in a while. I’m a datahead. I admit it. Thank you for sharing this.”) There were also many questions, maybe even a little skepticism. I will address some of those questions today.

Question: You discuss modeling to determine the optimum number of times to call prospects, but what about the cost of calling them?

A couple of readers wanted to know why we didn’t pay any attention to the cost of solicitation, and therefore return on investment. Wouldn’t it make sense to cut off calling a segment once “profitability” reached some unacceptably low point?

I agree that cost is important. Unfortunately, cost accounting can be complicated even within the bounds of a single program, let alone compared across institutions. In my own program, money for student wages comes from one source, money for technology and software support comes from another, while regular expenses such as phone and network charges are part of my own budget. If I cannot realize efficiencies in my spending and reallocate dollars to other areas, does it makes sense to include them in my cost accounting? I’m not sure.

And is it really a matter of money? I would argue that the budget determines how many weeks of calling are possible. Therefore, the limiting factor is actually TIME. Many (most?) phone programs do little more than call as many people as possible in the time available. They call with no regard for prospects’ probability of giving (aside from favouring LYBUNTs), spreading their limited resources evenly over all prospects — that is, suboptimally.

The first step, then, is to spend more time calling prospects who are likely to answer the phone, and less time calling prospects who aren’t. ROI is important, but if you’re not segmenting properly then you’re always going to end up simultaneously giving up on high-value prospects prematurely AND hanging on to low-value prospects beyond the limit of profitability.

Wylie and Sammis’s paper provides insight into a way we might intelligently manage our programs, mainly by showing a way to focus limited resources, and more generally by encouraging us to make use of the trove of data generated by automated calling software. Savvy annual fund folks who really have a handle on costs and want to delve into ROI as well should step up and do so — we’d love to see that study. (Although, I have to say, I’m not holding my breath.)

Question: Which automated calling software did these schools use?

The data samples were obtained from three schools who use the software of a single vendor, and participants were invited via the vendor’s client listserv. The product is called CampusCall, by RuffaloCODY. Therefore the primary audience of this paper could assume that Wylie and Sammis were addressing auto dialers and not predictive dialers or manual programs. This is not an endorsement of the product — any automated calling software should provide the ability to export data suitable for analysis.

By the way, manual calling programs can also benefit from data mining. There may be less call-result data to feed back into the modeling process than there would be in an automated system, but there is no reason why modeling cannot be used to segment intelligently in a manual program.

If you have a manual program and you’re calling tens of thousands of alumni — consider automating. Seriously.

Question: What do some of these “call result” categories mean?

At the beginning of the study, all the various codes for call results were divided into two categories, ‘contact made’ and ‘contact not made’. Some readers were curious about what some of the codes meant. Here are some of the codes that have meanings which are not obvious. None of these are contacts.

  • Re-assigned: The phone number has been reassigned to a new person. The residents at this phone number do not know the prospect you are attempting to reach.
  • FAX2: The call went to a fax, modem or data line for the second time — this code removes the number from more calling.
  • Hung up: This is technically a contact, but so often the caller doesn’t know if the prospect answered (or someone else in the household), and often the phone is hung up before the caller can introduce him/herself, in which case the encounter doesn’t meet the definition of a contact, which is an actual conversation with the prospect. So we didn’t count these as contacts.
  • Call back2: The prospect or someone else in the household asks to be called back some other time, but if this was the last result code, no future attempt was made. Not a contact.
  • NAO: Not Available One Hour. The prospect can’t come to the phone, call back in an hour — but obviously the callback didn’t happen, because NAO is still the last result.

Question: Why did you include disconnects and wrong numbers in your analysis? Wouldn’t you stop calling them (presumably after the first attempt), regardless of what their model score was? A controlled experiment would seem to call for leaving them out, and your results might be less impressive if you did so.

Good point. When a phone number proves invalid (as opposed to simply going to an answering machine or ringing without an answer), there’s no judgement possible about whether to place one more call to that number. Regardless of the affinity score, you’re done with that alum.

If we conducted a new study, perhaps we would exclude bad phone numbers. It’s my opinion that rerunning the analysis would be more of a refinement on what we’ve learned here, rather than uncovering something new. I think it’s up to the people who use this data in their programs to take this new idea and mine their own data in the light of it — and yes, refine it as well.

This was not a controlled experiment, by the way. This was a data-mining exploration which revealed a useful insight which, the authors hope, will lead to others digging into their own call centre data. True controlled experiments are hard to do — but wouldn’t it be great if fundraisers would collaborate with the experts in statistics and experimental design teaching on their own campuses?

Question: What modeling methods did you use? Did you compare models?

The paper made reference to multiple linear regression, which implies that the dependent variable is continuous. The reader wanted to know if the modeling method was actually logistic regression, or if two or more models were created and compared against a holdout sample.

The outcome variable was in fact a binary variable, “contact made”. Every prospect could have only two states (contacted / not contacted), because each person can be contacted only once. The result of a contact might be a pledge, no pledge, maybe, or “do not call” — but in any case, the result is binary.

(Only one model was created and there was no validation set, because this was more of an exploration to discover whether doing so could yield a model with practical uses, rather than a model built to be employed in a program.)

Although the DV was binary, the authors used multiple regression. A comparison of the two methods would be interesting, but Wylie and Sammis have found that when the splits for the dependent variable get close to 50/50 (as was the case here), multiple linear regression and logistic regression yield pretty much the same results. In the software package they use, multiple regression happens to be far more flexible than logistic, changes in the fit of the model as predictors are swapped in and out are more evident, and the output variable is easier to interpret.

Where the authors find logistic regression is superior to multiple regression is in building acquisition or planned giving models where the 0/1 splits are very asymmetric.

Question: Why did you choose to train the model on contacts made instead of pledges made?

Modeling on “contact made” instead of on “pledge made” is a bit novel. But that’s the point. The sticking point for Phonathon programs these days is simply getting someone to pick up the phone. If that’s the business problem to be solved, then (as the truism in data mining goes), that’s how the model should be focused. We see the act of answering the phone as a behaviour distinct from actually making a pledge. Obviously, they are related. But someone who picks up the phone this year and says “no” is still a better prospect in the long run than someone who never answers the call. A truly full-bodied segmentation for Phonathon would score prospects on both propensity to answer the phone and propensity to give — perhaps in a matrix, or using a multiplied score composed of both components.

Question: I don’t understand how you decided which years to include in the class year deciles. Was it only dividing into equal portions? That doesn’t seem right.

Yes, all the alumni in the sample were divided into ten roughly equal groups (deciles) in order by class year. There was no need to make a decision about whether to include a particular year in one decile or the other: The stats software made that determination simply by making the ten groups as equal as possible.

The point of that exercise was to see whether there was any general (linear) trend related to the age of alumni. In the study, the trend was not a straight line, but it was close enough to work well in the model — in general, the likelihood of answering the phone increases with age. Dividing the class years into deciles is not strictly necessary — it was done simply to make the relationship easier to find and explain. In practice, class year (or age) would be more likely to be placed into the regression analysis as-is, not as deciles.

BUT, Peter Wylie notes that the questioner has a point. Chopping ‘class year’ into deciles might not be the best option. For example, he says, take the first decile (the oldest alums) and the tenth decile (the youngest alums): “The range for the former can easily be from 1930-1968, while the range for the latter is more likely to be 2006-2011. The old group is very heterogeneous and the young group is very homogeneous. From the standpoint of clearly seeing non-linearity in the relationship between how long people have been out of school and giving, it would be better to divide the entire group up into five-year intervals.” The numbers of alumni in the intervals will vary hugely, but it also might become more apparent that the variable will need to be transformed (by squaring or cubing perhaps) before placing it into the regression.

Another question about class year came from a reader at an institution that is only 20 years old. He wanted to know if he could even use Class Year as a predictor. Yes, he can, even if it has a restricted range — it might still yield a roughly linear trend. There is no requirement to chop it into deciles.

A final word

The authors had hoped to hear from folks who write about the annual fund all the time (but never mention data driven decision making), or from the vendors of automated calling software themselves. Both seem especially qualified to speak on this topic. But so far, nothing.

21 June 2011

How many times to keep calling?

Guest post by Peter Wylie and John Sammis

(Click to download a printer-friendly .PDF version here: NUMBER OF ATTEMPTS 050411)

Since Kevin MacDonell took over the phonathon at Dalhousie University, he and I have had a number of discussions about the call center and how it works. I’ve learned a lot from these discussions, especially because Kevin often raises intriguing questions about how data analysis can make for a more efficient and productive calling process.

One of the questions he’s concerned with is the number of call attempts it’s worth making to a given alum. That is, he’s asking, “How many attempts should my callers make before they ‘make contact’ with an alum and either get a pledge or some other voice-to-voice response – or they give up and stop calling?”

Last January Kevin was able to gather some calling data from several schools that may, among other things, offer the beginnings of a methodology for answering this question. What we’d like to do in this piece is walk you through a technique we’ve tried, and we’d like to ask you to send us some reactions to what we’ve done.

Here’s what we’ll cover:

  1. How we decided whether contact was made (or not) with 41,801 alums who were recently called by the school we used for this exercise.
  2. Our comments on the percentage of contacts made and the pledge money raised for each of eight categories of attempts: 1, 2, 3, 4, 5, 6, 7, and 8 or more.
  3. How we built an experimental predictive model for the likelihood of making contact with a given alum.
  4. How we used that model to see when it might (and might not) make sense to keep calling an alum.

Deciding Whether Contact Was Made

            John Sammis and I do tons of analyses on alumni databases, but we’re nowhere near as familiar with call center data as Kevin is. So I asked him to take a look at the table you see below that shows the result of the last call made to almost 42,000 alums. Then I asked, “Kevin, which of these results would you classify as contact made?”

Table 1: Frequency Percentage Distribution for Results of Last Call Made to 41,801 Alums

He said he’d go with these categories:

  • ALREADY PLEDGED
  • NO PLEDGE
  • NO SOLICIT
  • REMOVE LIST
  • SPEC PLDG (i.e., Specified Pledge)
  • UNSP PLDG  (i.e., Unspecified Pledge)
  • DO NOT CALL

Kevin’s reasoning was that, with each of these categories, there was a final “voice to voice” discussion between the caller and the alum. Sometimes this discussion had a pretty negative conclusion. If the alum says “do not call” or “remove from list” (1.13% and 0.10% respectively), that’s not great. “No pledge” (29.72%) and “unspecified pledge” (4.15%) are not so hot either, but at least they leave the door open for the future. “Already pledged” (1.06%)? What can you say to that one? “And which decade was that, sir?”

Lame humor aside, the point is that Kevin feels (and I agree), that, for this school, these categories meet the criterion of “contact made.” The others do not.

Our Comments on Percentage Contact Made and Pledge Money Raised for Each of Eight Categories of Attempts

            Let’s go back to the title of this piece: “How Many Times to Keep Calling?” Maybe the simplest way to decide this question is to look at the contact rate as well as the pledge rate by attempt. Why not? So that’s what we did. You can see the results in Table 2 and Figure 1 and Table 3 and Figure 2.

Table 2: Number of Contacts Made and Percentage Contact Made For Each of Eight Categories of Attempts

Table 3: Total pledge dollars and mean pledge dollars received for each of eight categories of attempts

 We’ve taken a hard look at both these tables and figures, and we’ve concluded that they don’t really offer helpful guidelines for deciding when to stop calling at this school. Why? We don’t see a definitive number of attempts where it would make sense to stop.  To get specific, let’s go over the attempts:

  • 1st attempt: This attempt clearly yielded the most alums contacted (6,023) and the most dollars pledged ($79,316). However, stopping here would make little sense if only for the fact that the attempt yielded only a third of the $230,526 that would eventually be raised.
  • 2nd attempt: Should we stop here? Well, $49,385 was raised, and the contact rate has now jumped from about 50% to over 60%. We’d say keep going.
  • 3rd attempt: How about here? Over $30,000 raised and the contact rate has jumped even a bit higher. We’re not stopping.
  • 4th attempt: Here things start to go downhill a bit. The contact rate has fallen to about 43% and the total pledges raised have fallen below $20,000. However, if we stop here, we’ll be leaving more money on the table.
  • 5th attempt through 8 or more attempts: What can we say? Clearly the contact rates are not great for these attempts; they never get above the 40% level. Still, money for pledges continues to come in – over $50,000.

Even before we looked at the attempts data, we were convinced that the right question was not: “How many call attempts should be made before callers stop?” The right question was: “How many call attempts should be made with what alums?” In other words, with some alums it makes sense to keep calling until you reach them and have a chance to ask for a pledge. With others, that’s not a good strategy. In fact, it’s a waste of time and energy and money.

So, how do you identify those alums who should be called a lot and those who shouldn’t?

How We Built an Experimental Predictive Model for the Likelihood of Making Contact with a Given Alum

            This was Kevin’s idea. Being a strong believer in data-driven decision making, he firmly believed it would be possible to build a predictive model for making contact with alums. The trick would be finding the right predictors.

Now we’re at a point in the paper where, if we’re not careful, we risk confusing you more than enlightening you. The concept of model building is simple. The problem is that constructing a model can get very technical; that’s where the confusing stuff creeps in.

So we’ll stay away from the technical side of the process and just try to cover the highpoints. For each of the 41,801 alumni included in this study we amassed data on the following variables:

  • Email (whether or not the alum had an email addressed listed in the database)
  • Lifetime hard credit dollars given to the school
  • Preferred class year
  • Year of last gift made over the phone (if one was ever made)
  • Marital status missing (whether or not there was no marital code whatsoever for the alum in the marital status field)
  • Event Attendance (whether or not the alum had ever attended an event since graduation)

With these variables we used a technique called multiple regression to combine the variables into a score that could be used to predict an alum’s likelihood of being contacted by a caller. Because multiple regression is hard to get one’s arms around, we won’t try to explain that part of what we did. We’ll just ask you to trust us that it worked pretty well.

What we will do is show you the relationship between three of the above variables and whether or not contact was made with an alum. This will give you a sense of why we included them as predictors in the model.

We’ll start with lifetime giving. Table 4 and Figure 3 show that as lifetime giving goes up, the likelihood of making contact with an alum also goes up. Notice that callers are more than twice as likely to make contact with alums who have given $120 or more lifetime (75.4%) than they are to make contact with alums whose lifetime giving is zero (34.9%).

Table 4: Number of Contacts Made and Percentage Contact Made for Three Levels of Lifetime Giving

How about Preferred Class Year? The relationship between this variable and contact rate is a bit complicated. You’ll see in Table 5 that we’ve divided class year into ten roughly equal size groups called “deciles.” The first decile includes alums whose preferred class year goes from 1964 to 1978. The second decile includes alums whose preferred class year goes from 1979 to 1985. The tenth decile includes alums whose preferred class year goes from 2008 to 2010.

A look at Figure 4 shows that contact rate is highest with the older alums and then gradually falls off as the class years get more recent. However, the rate rises a bit with the most recent alums. Without going into boring and confusing detail, we can tell you that we’re able to use this less than straight line relationship in building our model.

 

Table 5: Percentage Contact Made by Class Year Decile

The third variable we’ll look at is Event Attendance. Table 6 and Figure 5 show that, although relatively few alums (2,211) attended an event versus those who did not (35,590), the contact rate was considerably higher for the event attenders than the non-attenders: 58.3% versus 41.4%.

Table 6: Percentage Contact Made by Event Attendance

The predictive model we built generated a very granular score for each of the 41,801 alums in the study. To make it easier to see how these scores looked and worked, we collapsed the alums into ten roughly equal size groups (called deciles) based on the scores. The higher the decile the better the scores. (These deciles are, of course, different from the deciles we talked about for Preferred Class Year.)

Shortly we’ll talk about how we used these decile scores as a possible method for deciding when to stop calling. But first, let’s look at how these scores are related to both contact rate and pledging. Table 7 and Figure 6 deal with contact rate.

Table 7: Number of Contacts Made and Percentage Contact Made, by Contact Score Decile

Clearly, there is a strong relationship between the scores and whether contact was made. Maybe the most striking aspect of these displays is the contrast between contact rate for alums in the 10th decile and that for those in the first decile: 79.9% versus 19.2%. In practical terms, this means that, over time in this school, your callers are going to make contact with only one in every five alums in the first decile. But in the 10th decile? They should make contact with four in every five alums.

How about pledge rates?  We didn’t build this model to predict pledge rates. However, look at Table 8 and Figure 7. Notice the striking differences between the lower and upper deciles in terms of total dollars pledged. For example, we can compare the total pledge dollars received for the bottom 20% of alums called (deciles 1 and 2) and the top 20% of alums called (deciles 9 and 10): about $2,700 versus almost $200,000.

Table 8: Total Pledge Dollars and Mean Pledge Dollars Received by Contact Score Decile

How We Used the Model to See When It Might (And Might Not) Make Sense to Keep Calling an Alum

In this section we have a lot of tables and figures for you to look at. Specifically, you’ll see:

  • Both the number of contacts made and the contact rate by decile score level for each of the first six attempts. (We decided to cut things off at the sixth attempt for reasons we think you’ll find obvious.)
  • A table that shows the total pledge dollars raised for each attempt by decile score level.

Looked at from one perspective, there is a huge amount of information to absorb in all this. Looked at from another perspective, we believe there are a few obvious facts that emerge.

Go ahead and browse through the tables and figures for each of the six attempts. After you finish doing that, we’ll tell you what we see.

The First Attempt

Table 9: Number of Contacts Made and Percentage Contact Made, by Contact Score Decile for the First Attempt

The Second Attempt

Table 10: Number of Contacts Made and Percentage Contact Made, by Contact Score Decile for the Second Attempt

The Third Attempt

Table 11: Number of Contacts Made and Percentage Contact Made by Contact Score Decile for the Third Attempt

The Fourth Attempt

Table 12: Number of Contacts Made and Percentage Contact Made by Contact Score Decile for the Fourth Attempt

The Fifth Attempt

Table 13: Number of Contacts Made and Percentage Contact Made by Contact Score Decile for the Fifth Attempt

The Sixth Attempt

Table 14: Number of Contacts Made and Percentage Contact Made by Contact Score Decile for the Sixth Attempt

This is what we see:

  • For each of the six attempts, the contact rate increases as the score decile increases. There are some bumps and inconsistencies along the way (see Figure 10, for example), but this is clearly the overall pattern for each of the attempts.
  • For all the attempts, the contact rate for the lowest 20% of scores (deciles 1 and 2) is always substantially lower than the contact rate for the highest 20% of scores (deciles 9 and 10).
  • Once we reach the sixth attempt, the contact rates fall off dramatically for all but the tenth decile.

Now take a look at Table 15 that shows the total pledge money raised for each attempt (including the seventh attempt and eight or more attempts) by score decile. You can also look at Table 16 which shows the same information but with the amounts exceeding $1,000 highlighted in red.

Table 15: Total Pledge Dollars Raised In Each Attempt by Contact Score Decile

Table 16: Total Pledge Dollars Raised In Each Attempt by Contact Score Decile with Pledge Amounts Greater Than $1,000 Highlighted In Red

We could talk about these two tables in some detail, but we’d rather just say, “Wow!”

Some Concluding Remarks

            We began this paper by saying that we wanted to introduce what might be the beginnings of a methodology for answering the question: “How many attempts should my callers make before they ‘make contact’ with an alum and either get a pledge or some other voice to voice response – or they give up and stop calling?”

We also said we’d like to walk you through a technique we’ve tried, and we’d like to ask you to send us some reactions to what we’ve done. So, if you’re willing, we’d really appreciate your getting back to us with some feedback on what we’ve done here.

Specifically, you might tell us how much you agree or disagree with these assertions:

  • There is no across-the-board number of attempts that you should apply in your program, or even to any segment in your program; the number of attempts you make to reach an alum very much depends on who that alum is.
  • There are some alums who should be called and called because you will eventually reach them and (probably) receive a pledge from them. There are other alums who should be called once, or not at all.
  • If the school we used in this paper is at all representative of other schools that do calling, all across North America huge amounts of time and money are wasted trying to reach alums with whom contact will never be made nor will any pledges be raised.
  • Anyone who is at a high level of decision making regarding the annual fund (whether inside the institution or a vendor) should be leading the charge for the kind of data analysis shown in this paper. If they’re not, someone needs to have a polite little chat with them.

We look forward to getting your comments. (Comment below, or email Kevin MacDonell at kevin.macdonell@gmail.com.)

9 June 2011

Young alumni are a whole different animal

Filed under: Alumni, Annual Giving, Model building, Predictor variables — Tags: , — kevinmacdonell @ 12:23 pm

My Phonathon program hires about thirty students a year. These are mature, reliable employees whom I’d recommend to any prospective future employer. They’re also, well, young. When I was in university, many of them hadn’t even been born.

So, yeah, they’re different from me. They’re different in terms of girth, taste in music and facility with pop-culture references. And they’re different in the data.

Grads who are just beginning their careers as alumni will lack most of the engagement-related attributes we usually rely on for predictive models: event attendance, volunteer activity, employment updates, a business phone. Therefore, variables that relate to their recent student experience are likely to loom larger for them than for their older counterparts. At the same time, recent grads tend to have a richer variety of data in their records, as database usage has increased across the enterprise through the years.

These two differences mark young alumni as a distinct population: One, differences in the distribution of variables that all alumni share, and two, the existence of variables that only younger alumni can have.

It makes me wonder why I’m still lumping young alumni in with older alumni in my predictive models. You might recall that a while ago I was bragging about how well my Phonathon model worked to predict propensity to give in response to phone solicitation. I also mentioned that, unfortunately, the model under-performed in predicting acquisition of young donors.

Okay, it didn’t under-perform — it failed. I concluded that young alumni need their own, separate model.

Where do we draw the line for “young alumni”? One possibility is that we go with our program’s definition of young alums — for me, that’s anyone who has earned a degree in any of the past three years and is under 35. Others might use graduates of the last decade.

This might be fine, but keep in mind that the training sample in a predictive model doesn’t have to follow the strict definition of the population that the appeal is targeting. We need a critical mass of donors in our sample population in order to train the model, therefore we might be more successful if we drew a larger, more loosely-defined sample. Our sample will include some alumni who are slightly older than the alumni who will get the “young alum” appeal — that’s okay, because they’re in the sample for only one reason: training the model.

However you draw the line, the distinction rests on the answer to this question: Is the data that describes one group different from the data that describes another? They may all be alumni, but can they also be thought of as separate populations, in terms of the data that was collected on them?

If you audit the data in certain tables, you might be able to find an “information bump”. That’s what I call the approximate year in which an institution started collecting and storing a lot more information on incoming students. In the data I’m familiar with, that bump has occured in the last ten to 15 years.

One of the most noticeable areas where data recording has increased is in personal information. Nowadays you can find Social Security Number (or in Canada, Social Insurance Number), religion, ethnicity, next-of-kin information, citizenship, driver’s license status, even eye and hair colour. Auditing these fields will tell you when data collection was ramped up, but probably won’t yield many useful predictors as they don’t have much to do with engagement. Certain types of personal information may also be off limits to you.

Investigate personal information if you can, but be sure to look around for other, more relevant data. Some examples:

  • Whether they lived in residence — If you don’t have direct access to this, the answer might be lurking in the alum’s past address data.
  • Athletics involvement — Count of activities, or a yes/no indicator.
  • Club and society activities — Count of activities, or a yes/no indicator.
  • Greek society membership — Yes/no.
  • Whether they were transfer students or received all of their degree credits from your institution
  • Whether they were employed on campus while a student
  • Whether they were recipients of awards, prizes, scholarships or bursaries
  • Whether they signed up for Email for Life, or otherwise kept their university email address or other university login active — In my data, more than 98% of the most recent grad class has an active university login. That drops to about 84% for the grad class of 2010, then 38% for 2009. The percentages continue to fall gradually from there. This attrition effect might hide the fact that retaining a student login past graduation is a strong indicator of affinity. I will write more on this topic in a future post.
  • Online community membership or activity

Oh, and don’t ignore the usual variables, such as marital status! In any conventional predictive model I’ve ever worked on, having a marital status of “single” in the database was a strong negative predictor of giving. But when I reduced my sample to graduates from the past ten years who were no older than 35, I was surprised to see that predictor turn into a strong positive. Although married alumni were still more likely to give, the “singles” were right behind them — and far ahead of the alumni for whom the marital status was missing. In my new model, I will use both “married” and “single” as predictors. Although the marrieds are more likely to be donors, there are relatively few of them; being coded single in our database could well prove to be a leading predictor of giving. (You will need to know, of course, why some alums are coded and others not. I’m still investigating.)

When September rolls around, I’ll be another three months older, and there’s nothing I can do about that. At least I’ll know my hard-working callers will be well-focused, talking to the recent grads who are most ready to make their very first gift to the Annual Fund.

« Newer PostsOlder Posts »

Create a free website or blog at WordPress.com.