CoolData blog

22 September 2014

What predictor variables should you avoid? Depends on who you ask

People who build predictive models will tell you that there are certain variables you should avoid using as predictors. I am one of those people. However, we disagree on WHICH variables one should avoid, and increasingly this conflicting advice is confusing those trying to learn predictive modeling.

The differences involve two points in particular. Assuming charitable giving is the behaviour we’re modelling for, those two things are:

  1. Whether we should use past giving to predict future giving, and
  2. Whether attributes such as marital status are really predictors of giving.

I will offer my opinions on both points. Note that they are opinions, not definitive answers.

1. Past giving as a predictor

I have always stressed that if you are trying to predict “giving” using a multiple linear regression model, you must avoid using “giving” as a predictor among your independent variables. That includes anything that is a proxy for “giving,” such as attendance at a donor-thanking event. This is how I’ve been taught and that is what I’ve adhered to in practice.

Examples that violate this practice keep popping up, however. I have an email from Atsuko Umeki, IT Coordinator in the Development Office of the University of Victoria in Victoria, British Columbia*. She poses this question about a post I wrote in July 2013:

“In this post you said, ‘In predictive models, giving and variables related to the activity of giving are usually excluded as variables (if ‘giving’ is what we are trying to predict). Using any aspect of the target variable as an input is bad practice in predictive modelling and is carefully avoided.’  However, in many articles and classes I read and took I was advised or instructed to include past giving history such as RFA*, Average gift, Past 3 or 5 year total giving, last gift etc. Theoretically I understand what you say because past giving is related to the target variable (giving likelihood); therefore, it will be biased. But in practice most practitioners include past giving as variables and especially RFA seems to be a good variable to include.”

(* RFA is a variation of the more familiar RFM score, based on giving history — Recency, Frequency, and Monetary value.)

So modellers-in-training are being told to go ahead and use ‘giving’ to predict ‘giving’, but that’s not all: Certain analytics vendors also routinely include variables based on past giving as predictors of future giving. Not long ago I sat in on a webinar hosted by a consultant, which referenced the work of one well-known analytics vendor (no need to name the vendor here) in which it seemed that giving behaviour was present on both sides of the regression equation. Not surprisingly, this vendor “achieved” a fantastic R-squared value of 86%. (Fantastic as in “like a fantasy,” perhaps?)

This is not as arcane or technical as it sounds. When you use giving to predict giving, you are essentially saying, “The people who will make big gifts in the future are the ones who have made big gifts in the past.” This is actually true! The thing is, you don’t need a predictive model to produce such a prospect list; all you need is a list of your top donors.

Now, this might be reassuring to whomever is paying a vendor big bucks to create the model. That person sees names they recognize, and they think, ah, good — we are not too far off the mark. And if you’re trying to convince your boss of the value of predictive modelling, he or she might like to see the upper ranks filled with familiar names.

I don’t find any of that “reassuring.” I find it a waste of time and effort — a fancy and expensive way to produce a list of the usual suspects.

If you want to know who has given you a lot of money, you make a list of everyone in your database and sort it in descending order by total amount given. If you want to predict who in your database is most likely to give you a lot of money in the future, build a predictive model using predictors that are associated with having given large amounts of money. Here is the key point … if you include “predictors” that mean the same thing as “has given a lot of money,” then the result of your model is not going to look like a list of future givers — it’s going to look more like your historical list of past givers.

Does that mean you should ignore giving history? No! Ideally you’d like to identify the donors who have made four-figure gifts who really have the capacity and affinity to make six-figure gifts. You won’t find them using past giving as a predictor, because your model will be blinded by the stars. The variables that represent giving history will cause all other affinity-related variables to pale in comparison. Many will be rejected from the model for being not significant or for adding nothing additional to the model’s ability to explain the variance in the outcome variable.

To sum up, here are the two big problems with using past giving to predict future giving:

  1. The resulting insights are sensible but not very interesting: People who gave before tend to give again. Or, stated another way: “Donors will be donors.” Fundraisers don’t need data scientists to tell them that.
  2. Giving-related independent variables will be so highly correlated with giving-related dependent variables that they will eclipse more subtle affinity-related variables. Weaker predictors will end up getting kicked out of our regression analysis because they can’t move the needle on R-squared, or because they don’t register as significant. Yet, it’s these weaker variables that we need in order to identify new prospects.

Let’s try a thought experiment. What if I told you that I had a secret predictor that, once introduced into a regression analysis, could explain 100% of the variance in the dependent variable ‘Lifetime Giving’? That’s right — the highest value for R-squared possible, all with a single predictor. Would you pay me a lot of money for that? What is this magic variable that perfectly models the variance in ‘Lifetime Giving’? Why, it is none other than ‘Lifetime Giving’ itself! Any variable is perfectly correlated with itself, so why look any farther?

This is an extreme example. In a real predictive model, a predictor based on giving history would be restricted to giving from the past, while the outcome variable would be calculated from a more recent period — the last year or whatever. There should be no overlap. R-squared would not be 100%, but it would be very high.

The R-squared statistic is useful for guiding you as you add variables to a regression analysis, or for comparing similar models in terms of fit with the data. It is not terribly useful for deciding whether any one model is good or bad. A model with an R-squared of 15% may be highly valuable, while one with R-squared of 75% may be garbage. If a vendor is trying to sell you on a model they built based on a high R-squared alone, they are misleading you.

The goal of predictive modeling for major gifts is not to maximize R-squared. It’s to identify new prospects.

2. Using “attributes” as predictors

Another thing about that webinar bugged me. The same vendor advised us to “select variables with caution, avoiding ‘descriptors’ and focusing on potential predictors.” Specifically, we were warned that a marital status of ‘married’ will emerge as correlated with giving. Don’t be fooled! That’s not a predictor, they said.

So let me get this straight. We carry out an analysis that reveals that married people are more likely to give large gifts, that donors with more than one degree are more likely to give large gifts, that donors who have email addresses and business phone numbers in the database are more likely to give large gifts … but we are supposed to ignore all that?

The problem might not be the use of “descriptors,” the problem might be with the terminology. Maybe we need to stop using the word “predictor”. One experienced practitioner, Alexander Oftelie, briefly touched on this nuance in a recent blog post. I quote, (emphasis added by me):

“Data that on its own may seem unimportant — the channel someone donates, declining to receive the mug or calendar, preferring email to direct mail, or making ‘white mail’ or unsolicited gifts beyond their sustaining-gift donation — can be very powerful when they are brought together to paint a picture of engagement and interaction. Knowing who someone is isn’t by itself predictive (at best it may be correlated). Knowing how constituents choose to engage or not engage with your organization are the most powerful ingredients we have, and its already in our own garden.”

I don’t intend to critique Alexander’s post, which isn’t even on this particular topic. (It’s a good one – please read it.) But since he’s written this, permit me scratch my head about it a bit.

In fact, I think I agree with him that there is a distinction between a behaviour and a descriptor/attribute. A behaviour, an action taken at a specific point in time (eg., attending an event), can be classified as a predictor. An attribute (“who someone is,” eg., whether they are married or single) is better described as a correlate. I would also be willing to bet that if we carefully compared behavioural variables to attribute variables, the behaviours would outperform, as Alexander says.

In practice, however, we don’t need to make that distinction. If we are using regression to build our models, we are concerned solely and completely with correlation. To say “at best it may be correlated” suggests that predictive modellers have something better at their disposal that they should be using instead of correlation. What is it? I don’t know, and Alexander doesn’t say.

If in a given data set, we can demonstrate that being married is associated with likelihood to make a donation, then it only makes sense to use that variable in our model. Choosing to exclude it based on our assumption that it’s an attribute and not a behaviour doesn’t make business sense. We are looking for practical results, after all, not chasing some notion of purity. And let’s not fool ourselves, or clients, that we are getting down to causation. We aren’t.

Consider that at least some “attributes” can be stated in terms of a behaviour. People get married — that’s a behaviour, although not related to our institution. People get married and also tell us about it (or allow it to be public knowledge so that we can record it) — that’s also a behaviour, and potentially an interaction with us. And on the other side of the coin, behaviours or interactions can be stated as attributes — a person can be an event attendee, a donor, a taker of surveys.

If my analysis informs me that widowed female alumni over the age of 60 are extremely good candidates for a conversation about Planned Giving, then are you really going to tell me I’m wrong to act on that information, just because sex, age and being widowed are not “behaviours” that a person voluntarily carries out? Mmmm — sorry!

Call it quibbling over semantics if you like, but don’t assume it’s so easy to draw a circle around true predictors. There is only one way to surface predictors, which is to take a snapshot of all potentially relevant variables at a point in time, then gather data on the outcome you wish to predict (eg., giving) after that point in time, and then assess each variable in terms of the strength of association with that outcome. The tools we use to make that assessment are nothing other than correlation and significance. Again, if there are other tools in common usage, then I don’t know about them.

Caveats and concessions

I don’t maintain that this or that practice is “wrong” in all cases, nor do I insist on rules that apply universally. There’s a lot of art in this science, after all.

Using giving history as a predictor:

  • One may use some aspects of giving to predict outcomes that are not precisely the same as ‘Giving’, for example, likelihood to enter into a Planned Giving arrangement. The required degree of difference between predictors and outcome is a matter of judgement. I usually err on the side of scrupulously avoiding ANY leakage of the outcome side of the equation into the predictor side — but sure, rules can be bent.
  • I’ve explored the use of very early giving (the existence and size of gifts made by donors before age 30) to predict significant giving late in life. (See Mine your donor data with this baseball-inspired analysis.) But even then, I don’t use that as a variable in a model; it’s more of a flag used to help select prospects, in addition to modeling.

Using descriptors/attributes as predictors:

  • Some variables of this sort will appear to have subtly predictive effects in-model, effects that disappear when the model is deployed and new data starts coming in. That’s regrettable, but it’s something you can learn from — not a reason to toss all such variables into the trash, untested. The association between marital status and giving might be just a spurious correlation — or it might not be.
  • Business knowledge mixed with common sense will help keep you out of trouble. A bit of reflection should lead you to consider using ‘Married’ or ‘Number of Degrees’, while ignoring ‘Birth Month’ or ‘Eye Colour’. (Or astrological sign!)

There are many approaches one can take with predictive modeling, and naturally one may feel that one’s chosen method is “best”. The only sure way to proceed is to take the time to define exactly what you want to predict, try more than one approach, and then evaluate the performance of the scores when you have actual results available — which could be a year after deployment. We can listen to what experts are telling us, but it’s more important to listen to what the data is telling us.

//////////

Note: When I originally posted this, I referred to Atsuko Umeki as “he”. I apologize for this careless error and for whatever erroneous assumption that must have prompted it.

16 July 2013

Alumni engagement scoring vs. predictive modelling

Filed under: Alumni, engagement, predictive modeling — Tags: , , , — kevinmacdonell @ 8:06 am

Alumni engagement scoring has an undeniable appeal. What could be simpler? Just add up how many events an alum has attended, add more points for volunteering, add more points for supporting the Annual Fund, and maybe some points for other factors that seem related to engagement, and there you have your score. If you want to get more sophisticated, you can try weighting each score input, but generally engagement scoring doesn’t involve any advanced statistics and is easily grasped.

Not so with predictive modelling, which does involve advanced stats and isn’t nearly as intuitive; often it’s not possible to really say how an input variable is related to the outcome. It’s tempting, too, to think of an engagement score as being a predictor of giving and therefore a good replacement for modelling. Actually, it should be predictive — if it isn’t, your score is not measuring the right things — but an engagement score is not the same thing as a predictive model score. They are different tools for different jobs.

Not only are engagement scoring schemes different from predictive models, their simplicity is deceptive. Engagement scoring is incomplete without some plan for acting on observed trends with targeted programming. This implies the ability to establish causal drivers of engagement, which is a tricky thing.

That’s a sequence of events — not a one-time thing. In fact, engagement scoring is like checking the temperature at regular intervals over a long period of time, looking for up and down trends not just for the group as a whole but via comparisons of important subgroups defined by age, sex, class year, college, degree program, geography or other divisions. This requires discipline: taking measurements in exactly the same way every year (or quarter, or what-have-you). If the score is fed by a survey component, you must survey constantly and consistently.

Predictive models and engagement scores have some surface similarities. They share variables in common, the output of both is a numerical score applied to every individual, and both require database work and math in order to calculate them. Beyond that, however, they are built in different ways and for different purposes. To summarize:

  • Predictive models are collections of potentially dozens of database variables weighted according to strength of correlation with a well-defined behaviour one is trying to predict (eg. making a gift), in order to rank individuals by likelihood to engage in that behaviour. Both Alumni Relations and Development can benefit from the use of predictive models.
  • Engagement scores are collections of a very few selectively-chosen database variables, either not weighted or weighted according to common sense and intuition, in order to roughly quantify the quality of “engagement”, however one wishes to define that term, for each individual. The purpose is to allow comparison of groups (faculties, age bands, geographical regions, etc.) with each other. Comparisons may be made at one point in time, but it is more useful to compare relative changes over time. The main user of scores is Alumni Relations, in order to identify segments requiring targeted programming, for example, and to assess the impact of programming on targeted segments over time.

Let’s explore key differences in more depth:

The purpose of modelling is prediction, for ranking or segmentation. The purpose of engagement scoring is comparison.

Predictive modelling scores are not usually included in reports. Used immediately in decision making, they may never be seen by more than one or two people. Engagement scores are included in reports and dashboards, and influence decision-making over a long span of time.

The target variable of a predictive model is quantifiable (eg. giving, measurable in dollars). In engagement scoring, there is no target variable, only an output – a construct called “engagement”, which itself is not directly measurable.

Potential input variables for predictive models are numerous (100+) and vary from model to model. Input variables for engagement scores are limited to a handful of easily measured attributes (giving, event attendance, volunteering) which must remain consistent over time.

Variables for predictive models are chosen primarily using statistical methods (correlation) and only secondarily using judgment and “common sense.” For example, if the presence of a business phone number is highly correlated with being a donor, it may be included in the model. For engagement scores, variables are chosen by consensus of stakeholders, primarily according to subjective standards. For example, event attendance and giving would probably be deemed by the committee to indicate engagement, and would therefore be included in the score. Advanced statistics rarely come into play. (For more thoughts on this, read How you measure alumni engagement is up to you.)

In predictive models, giving and variables related to the activity of giving are usually excluded as variables (if ‘giving’ is what we are trying to predict). Using any aspect of the target variable as an input is bad practice in predictive modelling and is carefully avoided. You wouldn’t, for example, use attendance at a donor recognition event to predict likelihood to give. In engagement scoring, though, giving history is usually a key input, as it is common sense to believe that being a donor is an indication of engagement. (It might be excluded or reported separately if the aim is to demonstrate the causal link between engagement indicators and giving.)

Modelling variables are weighted using multiple linear regression or other statistical method which calculates the relative influence of each variable while simultaneously controlling for the influence of all other variables in the model. Engagement score variables are usually weighted according to gut feel. For example, coming to campus for Homecoming seems to carry more weight than showing up for a pub night in one’s own city, therefore we give it more weight.

The quality of a predictive model is testable, first against a validation data set, and later against actual results. But there is no right or wrong way to estimate engagement, therefore the quality of scores cannot be evaluated conclusively.

The variables in a predictive model have complex relationships with each other that are difficult or impossible to explain except very generally. Usually there is no reason to explain a model in detail. The components in an engagement score, on the other hand, have plausible (although not verifiable) connections to engagement. For example, volunteering is indicative of engagement, while Name Prefix is irrelevant.

Predictive models are built for a single, time-limited purpose and then thrown away. They evolve iteratively and are ever-changing. On the other hand, once established, the method for calculating an engagement score must not change if comparisons are to be made over time. Consistency is key.

Which is all to say: alumni engagement scoring is not predictive modelling. (And neither is RFM analysis.) Only predictive modelling is predictive modelling.

13 November 2012

Making a case for modeling

Guest post by Peter Wylie and John Sammis

(Click here to download post as a print-friendly PDF: Making a Case for Modeling – Wylie Sammis)

Before you wade too far into this piece, let’s be sure we’re talking to the right person. Here are some assumptions we’re making about you:

  • You work in higher education advancement and are interested in analytics. However, you’re not a sophisticated stats person who throws around terms like regression and cluster analysis and neural networks.
  • You’re convinced that your alumni database (we’ll leave “parents” and “friends” for a future paper) holds a great deal of information that can be used to pick out the best folks to appeal to — whether by mail, email, phone, or face-to-face visits.
  • Your boss and your boss’s bosses are, at best, less convinced than you are about this notion. At worst, they have no real grasp of what analytics (data mining and predictive modeling) are. And they may seem particularly susceptible to sales pitches from vendors offering expensive products and services for using your data – products and services you feel might cause more problems than they will solve.
  • You’d like to find a way to bring these “boss” folks around to your way of thinking, or at least move them in the “right” direction.

If we’ve made some accurate assumptions here, great. If we haven’t, we’d still like you to keep reading. But if you want to slip out the back of the seminar room, not to worry. We’ve done it ourselves more times than you can count.

Okay, here’s something you can try:

1. Divide the alums at your school into ten roughly equal size groups (deciles) by class year. Table 1 is an example from a medium sized four year college.

Table 1: Class Years and Counts for Ten Roughly Equal Size Groups (Deciles) of Alumni at School A

2. Create a very simple score:

EMAIL LISTED(1/0) + HOME PHONE LISTED(1/0)

This score can assume three values: “0, “1”, or “2.” A “0” means the alum has neither an email nor a home phone listed in the database. A “1” means the alum has either an email listed in the database or a home phone listed in the database, but not both. A “2” means the alum has both an email and a home phone listed in the database.

3. Create a table that contains the percentage of alums who have contributed at least $1,000 lifetime to your school for each score level for each class year decile. Table 1 is an example of such a table for School A.

Table 2: Percentage of Alumni at Each Simple Score Level at Each Class Year Decile Who Have Contributed at Least $1,000 Lifetime to School A

 

4. Create a three dimensional chart that conveys the same information contained in the table. Figure 1 is an example of such a chart for School A.

In the rest of this piece we’ll be showing tables and charts from seven other very diverse schools that look quite similar to the ones you’ve just seen. At the end, we’ll step back and talk about the importance of what emerges from these charts. We’ll also offer advice on how to explain your own tables and charts to colleagues and bosses.

If you think the above table and chart are clear, go ahead and start browsing through what we’ve laid out for the other seven schools. However, if you’re not completely sure you understand the table and the chart, see if the following hypothetical questions and answers help:

Question: “Okay, I’m looking at Table 2 where it shows 53% for alums in Decile 1 who have a score of 2. Could you just clarify what that means?”

Answer. “That means that 53% of the oldest alums at the school who have both a home phone and an email listed in the database have given at least $1,000 lifetime to the school.”

Question. “Then … that means if I look to the far left in that same row where it shows 29% … that means that 29% of the oldest alums at the school who have neither a home phone nor an email listed in the database have given at least $1,000 lifetime to the school?”

Answer. “Exactly.”

Question. “So those older alums who have a score of 2 are way better givers than those older alums who have a score of 0?”

Answer. “That’s how we see it.”

Question. “I notice that in the younger deciles, regardless of the score, there are a lot of 0 percentages or very low percentages. What’s going on there?”

Answer. “Two things. One, most younger alums don’t have the wherewithal to make big gifts. They need years, sometimes many years, to get their financial legs under them. The second thing? Over the last seven years or so, we’ve looked at the lifetime giving rates of hundreds and hundreds of four-year higher education institutions. The news is not good. In many of them, well over half of the solicitable alums have never given their alma maters a penny.”

Question. “So, maybe for my school, it might be good to lower that giving amount to something like ‘has given at least $500 lifetime’ rather than $1,000 lifetime?”

Answer. Absolutely. There’s nothing sacrosanct about the thousand dollar level that we chose for this piece. You can certainly lower the amount, but you can also raise the amount. In fact, if you told us you were going to try several different amounts, we’d say, “Fantastic!”

Okay, let’s go ahead and have you browse through the rest of the tables and charts for the seven schools we mentioned earlier. Then you can compare your thoughts on what you’ve seen with what we think is going on here.

(Note: After looking at a few of the tables and charts, you may find yourself saying, “Okay, guys. Think I got the idea here.” If so, go ahead and fast forward to our comments.)

Table 3: Percentage of Alumni at Each Simple Score Level at Each Class Year Decile Who Have Contributed at Least $1,000 Lifetime to School B

 

Table 4: Percentage of Alumni at Each Simple Score Level at Each Class Year Decile Who Have Contributed at Least $1,000 Lifetime to School C

Table 5: Percentage of Alumni at Each Simple Score Level at Each Class Year Decile Who Have Contributed at Least $1,000 Lifetime to School D

Table 6: Percentage of Alumni at Each Simple Score Level at Each Class Year Decile Who Have Contributed at Least $1,000 Lifetime to School E

Table 7: Percentage of Alumni at Each Simple Score Level at Each Class Year Decile Who Have Contributed at Least $1,000 Lifetime to School F

Table 8: Percentage of Alumni at Each Simple Score Level at Each Class Year Decile Who Have Contributed at Least $1,000 Lifetime to School G

Table 9: Percentage of Alumni at Each Simple Score Level at Each Class Year Decile Who Have Contributed at Least $1,000 Lifetime to School H

Definitely a lot of tables and charts. Here’s what we see in them:

  • We’ve gone through the material you’ve just seen many times. Our eyes have always been drawn to the charts; we use the tables for back-up. Even though we’re data geeks, we almost always find charts more compelling than tables. That is most certainly the case here.
  • We find the patterns in the charts across the seven schools remarkably similar. (We could have included examples from scores of other schools. The patterns would have looked the same.)
  • The schools differ markedly in terms of giving levels. For example, the alums in School C are clearly quite generous in contrast to the alums in School F. (Compare Figure 3 with Figure 6.)
  • We’ve never seen an exception to one of the obvious patterns we see in these data: The longer alums have been out of school, the more money they have given to their school.
  • The “time out of school” pattern notwithstanding, we continue to be taken by the huge differences in giving levels (especially among older alums) across the levels of a very simple score. School G is a prime example. Look at Figure 7 and look at Table 8. Any way you look at these data, it’s obvious that alums who have even a score of “1” (either a home phone listed or an email listed, but not both) are far better givers than alums who have neither listed.

Now we’d like to deal with an often advanced argument against what you see here. It’s not at all uncommon for us to hear skeptics say: “Well, of course alumni on whom we have more personal information are going to be better givers. In fact we often get that information when they make a gift. You could even say that amount of giving and amount of personal information are pretty much the same thing.”

We disagree for at least two reasons:

Amount of personal information and giving in any alumni database are never the same thing. If you have doubts about our assertion, the best way to dispel those doubts is to look in your own alumni database. Create the same simple score we have for this piece. Then look at the percentage of alums for each of the three levels of the score. You will find plenty of alums who have a score of 0 who have given you something, and you will find plenty of alums with a score of 2 who have given you nothing at all.

We have yet to encounter a school where the IT folks can definitively say how an email address or a home phone number got into the database for every alum. Why is that the case? Because email addresses and home phone numbers find their way into alumni database in a variety of ways. Yes, sometimes they are provided by the alum when he or she makes a gift. But there are other ways. To name a few:

  • Alums (givers or not) can provide that information when they respond to surveys or requests for information to update directories.
  • There are forms that alums fill out when they attend a school sponsored event that ask for this kind of information.
  • There are vendors who supply this kind of information.

Now here’s the kicker. Your reactions to everything you’ve seen in this piece are critical. If you’re going to go to a skeptical boss to try to make a case for scouring your alumni database for new candidates for major giving, we think you need to have several reactions to what we’ve laid out here:

1. “WOW!” Not, “Oh, that’s interesting.” It’s gotta be, “WOW!” Trust us on this one.

2. You have to be champing at the bit to create the same kinds of tables and charts that you’ve seen here for your own data.

3. You have to look at Table 2 (that we’ve recreated below) and imagine it represents your own data.

Table 2: Percentage of Alumni at Each Simple Score Level at Each Class Year Decile Who Have Contributed at Least $1,000 Lifetime to School A

Then you have to start saying things like:

“Okay, I’m looking at the third class year decile. These are alums who graduated between 1977 and 1983. Twenty-five percent of them with a score of 2 have given us at least $1,000 lifetime. But what about the 75% who haven’t yet reached that level? Aren’t they going to be much better bets for bigger giving than the 94% of those with a score of 0 who haven’t yet reached the $1,000 level?”

“A score that goes from 0 to 2? Really? What about a much more sophisticated score that’s based on lots more information than just email listed and home phone listed? Wouldn’t it make sense to build a score like that and look at the giving levels for that more sophisticated score across the class year deciles?”

If your reactions have been similar to the ones we’ve just presented, you’re probably getting very close to trying to making your case to the higher-ups. Of course, how you make that case will depend on who you’ll be talking to, who you are, and situational factors that you’re aware of and we’re not. But here are a few general suggestions:

Your first step should be making up the charts and figures for your own data. Maybe you have the skills to do this on your own. If not, find a technical person to do it for you. In addition to having the right skills, this person should think doing it would be cool and won’t take forever to finish it.

Choose the right person to show our stuff and your stuff to. More and more we’re hearing people in advancement say, “We just got a new VP who really believes in analytics. We think she may be really receptive to this kind of approach.” Obviously, that’s the kind of person you want to approach. If you have a stodgy boss in between you and that VP, find a way around your boss. There’s lots of ways to do that.

Do what mystery writers do; use the weapon of surprise. Whoever the boss you go to is, we’d recommend that you show them this piece first. After you know they’ve read it, ask them what they thought of it. If they say anything remotely similar to: “I wonder what our data looks like,” you say, “Funny you should ask.”

Whatever your reactions to this piece have been, we’d love to hear them.

10 October 2012

Logistic vs. multiple regression: Our response to comments

Guest post by John Sammis and Peter B. Wylie

Thanks to all of you who read and commented on our recent paper comparing logistic regression with multiple regression. We were not sure how popular this topic would be, but Kevin told us that interest was high, and there were a number of comments and questions. There were several general themes in the comments; Kevin has done an excellent job responding, but we thought we should throw in our two cents.

Why not just use logistic?

The point of our paper was not to suggest that logistic regression should not be used — our point was that multiple regression can achieve prediction results quite similar to logistic regression. Based on our experience working with and training fundraising professionals getting introduced to analytics, logistic regression can be intimidating. Our goal is always to get these folks to use analytics to help with their fundraising initiatives. We find many of them catch on with multiple regression, and much less so with logistic regression.

Predicted values vs. probabilities

We understand that the predicted values generated by multiple regression are different from the probabilities generated by logistic regression. Regardless of the statistic modeling technique we use, we always bin the raw prediction or probability values into equal-sized score levels. We have found that score level bins are easier to use than raw values. And using equal-sized score levels allows for easier evaluation of the scoring model.

“I cannot agree”

Some commenters, knowledgeable about statistics, said they would not use multiple regression when the inputs called for logistic. According to the rules, if the target variable is binary, then linear modelling doesn’t make sense — and the rules must be obeyed. In our view, this rigid approach to method selection is inappropriate for predictive modelling. The use of multiple linear regression in place of logistic regression may not always make theoretical sense, but predictive modellers are concerned with whether or not a model produces an output that is useful in practical terms. The worth of a model is testable against new, real-world data, therefore a model has only one criterion for determining “appropriate” use: Whether it really predicts what the modeler claims it will predict. The truth is revealed during evaluation.

A modest proposal

No one reading this should simply take our word that these two dissimilar methods yield similar results. Neither should anyone dismiss it out of hand without providing a critique based on real data. We would encourage anyone to try doing something on your own with data using both techniques and show us what you find. In particular, graduate students looking for a thesis or dissertation topic might consider producing something under this title: “Comparing Logistic Regression and Multiple Regression as Techniques for Predicting Major Giving.”

Heck! Peter says that if anyone were interested in doing a study like this for a thesis or dissertation, he would be willing to offer advice on how to:

  1. Do a thorough literature review
  2. Formulate specific research questions
  3. Come up with a study design
  4. Prepare a proposal that would satisfy a thesis or dissertation committee.

That’s quite an offer. How about it?

20 September 2012

When less data is more, in predictive modelling

When I started doing predictive modelling, I was keenly interested in picking the best and coolest predictor variables. As my understanding deepened, I turned my attention to how to define the dependent variable in order to really get at what I was trying to predict. More recently, however, I’ve been thinking about refining or limiting the population of constituents to be scored, and how that can help the model.

What difference does it make who gets a propensity score? Up until maybe a year ago, I wasn’t too concerned. Sure, probably no 22-year-old graduate had ever entered a planned giving agreement, but I didn’t see any harm in applying a score to all our alumni, even our youngest.

Lately, I’m not so sure. Using the example of a planned gift propensity model, the problem is this: Young alumni don’t just get a score; they also influence how the model is trained. If all your current expectancies were at least 50 before they decided to make a bequest, and half your alumni are under 30 years old, then one of the major distinctions your model will make is based on age. ANY alum over 50 is going to score well, regardless of whether he or she has any affinity to the institution, simply because 100% of your target is in that age group.

The model is doing the right thing by giving higher scores to older alumni. If ages in the sample range from 21 to 100+, then age as a variable will undoubtedly contribute to a large chunk of the model’s ability to “explain” the target. But this hardly tells us anything we didn’t already know. We KNOW that alumni don’t make bequest arrangements at age 22, so why include them in the model?

It’s not just the fact that their having a score is irrelevant. I’m concerned about allowing good predictor variables to interact with ‘Age’ in a way that compromises their effectiveness. Variables are being moderated by ‘Age’, without the benefit of improving the model in a way that we get what we want out of it.

Note that we don’t have to explicitly enter ‘Age’ as a variable in the model for young alumni to influence the outcome in undesirable ways. Here’s an example, using event attendance as a predictor:

Let’s say a lot of very young alumni and some very elderly constituents attend their class reunions. The older alumni who attend reunions are probably more likely than their non-attending classmates to enter into planned giving agreements — for my institution, that is definitely the case. On the other hand, the young alumni who attend reunions are probably no more or less likely than their non-attending peers to consider planned giving — no one that age is a serious prospect. What happens to ‘event attendance’ as a predictor in which the dependent variable is ‘Current planned giving expectancy’? … Because a lot of young alumni who are not members of the target variable attended events, the attribute of being an event attendee will be associated with NOT being a planned giving expectancy. Or at the very least, it will considerably dilute the positive association between predictor and target found among older alumni.

I confirmed this recently using some partly made-up data. The data file started out as real alumni data and included age, a flag for who is a current expectancy, and a flag for ‘event attendee’. I massaged it a bit by artificially bumping up the number of alumni under the age of 50 who were coded as having attended an event, to create a scenario in which an institution’s events are equally popular with young and old alike. In a simple regression model with the entire alumni file included in the sample, ‘event attendance’ was weakly associated with being a planned giving expectancy. When I limited the sample to alumni 50 years of age and older, however, the R squared statistic doubled. (That is, event attendance was about twice as effective at explaining the target.) Conversely, when I limited the sample to under-50s, R squared was nearly zero.

True, I had to tamper with the data in order to get this result. But even had I not, there would still have been many under-50 event attendees, and their presence in the file would still have reduced the observed correlation between event attendance and planned giving propensity, to no useful end.

You probably already know that it’s best not to lump deceased constituents in with living ones, or non-alumni along with alumni, or corporations and foundations along with persons. They are completely distinct entities. But depending on what you’re trying to predict, your population can fruitfully be split along other, more subtle distinctions. Here are a few:

  • For donor acquisition models, in which the target value is “newly-acquired donor”, exclude all renewed donors. You strictly want to have only newly-acquired donors and never-donors in your model. Your good prospects for conversion are the never-donors who most resemble the newly-acquired donors. Renewed donors don’t serve any purpose in such a model and will muddy the waters considerably.
  • Conversely, remove never-donors from models that predict major giving and leadership-level annual giving. Those higher-level donors tend not to emerge out of thin air: They have giving histories.
  • Looking at ‘Age’ again … making distinctions based on age applies to major-gift propensity models just as it does to planned giving propensity: Very young people do not make large gifts. Look at your data to find out at what age donors were when they first gave $1,000, say. This will help inform what your cutoff should be.
  • When building models specifically for Phonathon, whether donor-acquisition or contact likelihood, remove constituents who are coded Do Not Call or who do not have a valid phone number in the database, or who are unlikely to be called (international alumni, perhaps).
  • Exclude international alumni from event attendance or volunteering likelihood models, if you never offer involvement opportunities outside your own country or continent.

Those are just examples. As for general principles, I think both of the following conditions must be met in order for you to gain from excluding a group of constituents from your model. By a “group” I mean any collection of individuals who share a certain trait. Choose to exclude IF:

  1. Nearly 100% of constituents with the trait fall outside the target behaviour (that is, the behaviour you are trying to predict); AND,
  2. Having a score for people with that trait is irrelevant (that is, their scores will not result in any action being taken with them, even if a score is very low or very high).

You would apply the “rules” like this … You’re building a model to predict who is most likely to answer the phone, for use by Phonathon, and you’re wondering what to do with a bunch of alumni who are coded Do Not Call. Well, it stands to reason that 1) people with this trait will have little or no phone contact history in the database (the target behaviour), and 2) people with this trait won’t be called, even if they have a very high contact-likelihood score. The verdict is “exclude.”

It’s not often you’ll hear me say that less (data) is more. Fewer cases in your data file will in fact tend to depress your model’s R squared. But your ultimate goal is not to maximize R squared — it’s to produce a model that does what you want. Fitting the data is a good thing, but only when you have the right data.

20 August 2012

Logistic regression vs. multiple regression

Filed under: John Sammis, Model building, Peter Wylie, predictive modeling, regression, Statistics — kevinmacdonell @ 5:13 am

by Peter Wylie, John Sammis and Kevin MacDonell

(Click to download printer-friendly PDF: Logistic vs MR-Wylie Sammis MacDonell)

The three of us talk about this issue a lot because we encounter a number of situations in our work where we need to choose between these two techniques. Many of our late night/early morning phone/internet discussions have been gobbled up by talking about which technique seems to be better under what circumstances. More than a few times, I’ve suggested we write something up about our experience with both techniques. In the end we’ve always decided to put off doing that because … well, because we’ve thought it might put a lot of people to sleep. Disagree as we might about lots of things, we’re of one mind on the dictum: “Don’t bore people.” They have enough tedious stuff in their lives; we don’t need to add to their burden.

On the other hand, as analytics has started to sink its teeth more and more into the world of advancement, it seems there is a group of folks out there who wrestle with the same issue. And the issue seems to be this:

“If I have a binary dependent variable (e.g., major giver/ non major giver, volunteer/non-volunteer, reunion attender/non-reunion attender, etc.), which technique should I use? Logistic regression or multiple regression?”

We considered a number of ways to try to answer this question:

  • We could simply assert an opinion based on our bank of experience with both techniques.
  • We could show you the results of a number of data sets using both techniques and then offer our opinion.
  • We could show you a way to compare both techniques using some of your own data.

We chose the third option because we think there is no better way to learn about a statistical technique than by using the technique on real data. Whenever we’ve done this sort of exploring ourselves, we’ve been humbled by how much we’ve learned.

Before we show you a way to compare the two techniques, we’ll offer some thoughts on why this question (“Should I use logistic regression or multiple regression?”) is so tough to find an answer to. If you’re anxious to move on to our comparison process, you can skip this section. But we hope you don’t.

Why This Is Not an Easy Question to Find an Answer To

We see at least two reasons why this is so:

  • Multiple regression has lived in the neighborhood a long time; logistic regression is a new kid on the block.
  • The articles and books we’ve read on comparisons of the two techniques are hard to understand.

Multiple regression is a longtime resident; logistic regression is a new kid on the block.

When World War II came along, there was a pressing need for rapid ways to assess the potential of young men (and some women) for the critical jobs that the military services were trying to fill. It was in this flurry of preparation that multiple regression began to see a great deal of practical application by behavioral scientists who had left their academic jobs and joined up for the duration. The theory behind multiple regression had been worked out much earlier in the century by geniuses like Ronald Fisher, Karl Pearson, and Edward Hotelling. But the method did not get much use until the war effort necessitated that use. The computational effort involved was just too forbidding.

Logistic regression is a different story. From the reading we’ve done, logistic regression got its early practical use in the world of medicine where biostatisticians were trying to predict binary outcomes like survived/did not survive, contracted disease/did not contract disease, had a coronary event/did not have a coronary event, and the like. It’s only been within the last fifteen or twenty years that logistic regression has found its way into the parlance of statisticians in the behavioral sciences.

These two paragraphs are a long way around of saying that logistic regression is (in our opinion) nowhere near as well vetted as is multiple regression by people like us in advancement who are interested in predicting behavior, especially giving behavior.

The articles and books we’ve read on comparisons of the two techniques are hard to understand.

Since I (Peter) was pushing to do this piece, John and I decided it would be my responsibility to do some searching of the more recent literature on logistic regression as it relates to the substance of this project.

To start off, I reread portions of texts I have accumulated over the years that focus on multiple regression as a general data analytic technique. Each text has a section on logistic regression. As I waded back into these sections, I asked myself: “Is what I’m reading here going to enlighten more than confuse the folks we have in mind for this piece?”  Without exception, my answer was, “Nope, just the reverse.” There was altogether too much focus on complicated equations and theory and nowhere near enough emphasis on the practical use of logistic regression. (This, in spite of the fact that each text had an introduction ensuring us the book would go light on math and heavy on application.)

Then, using my trusty iPad, I set about seeing what I could find on the web. Not surprisingly, I found a ton of articles (and even some full length books) that had found their way into the public domain. I downloaded a bunch of them to read whenever I could find enough time to dig into them. I’m sorry to report that each time I’d give one of these things a try, I would hear my father’s voice (dad graduated third in his class in engineering school) as he paged through my own science and math texts when I was in college: “They oughta teach the clowns who wrote these things to write in plain English.” (I always tried to use such comments as excuses for bad grades. Never worked.)

Levity aside, it is hard to find clearly written articles or books on the use of logistic versus multiple regression in the behavioral sciences. I think it’s a bad situation that needs fixing, but that fixing won’t occur anytime soon. On the other hand, I think dad was right not to let me off easy for giving up on badly written material. And you shouldn’t let my pessimism dissuade you from trying out some of these same articles and books. (If enough of you are interested, perhaps Kevin and John and I can put together a list of suggested readings.)

A Way to Compare Logistic Regression with Multiple Regression

As promised we’ll take you through a set of steps you can use with some of your own data:

  1. Pick a binary dependent variable and a set of predictors.
  2. Compute a predicted probability value for every record in your sample using both multiple regression and logistic regression.
  3. Draw three random subsamples of 20 records each from the total sample so that each subsample includes the predicted multiple regression probability value and the predicted logistic regression probability value for every record.
  4. Display each subsample of these records in a table and a graph.
  5. Do an eyeball comparison of the probability values in both the tables and the graphs.

1. Pick a binary dependent variable and a set of predictors.

For this example, we used a private four year institution with about 13,000 solicitable alums. Here are the variables we chose:

Dependent variable. Each alum who had given $31 or more lifetime was defined as 1, all others who had given less than that amount were defined as 0. There were 6,293 0’s and 6,204 1’s. Just about an even fifty/fifty split.

Predictor variables:

  • CLASS YEAR
  • SQUARE OF CLASS YEAR
  • EMAIL ADDRESS LISTED (YES/NO, 1=YES, 0=NO)
  • MARITAL STATUS (SINGLE =1, ALL OTHERS=0)
  • HOME PHONE LISTED (YES/NO, 1=YES, 0=NO)
  • UNIQUE ID NUMBER

Why did we use ID number as one of the predictors? Over the years we’ve found that many schools use all-numeric ID numbers. When these numbers are entered into a regression analysis, they often work as predictors. More importantly, they help to create very granular predicted scores that can easily be binned into equal size groups.

2. Compute a predicted probability value for every record in your sample using both multiple regression and logistic regression.

This is where things start to get a bit technical and where a little background reading on both multiple regression and logistic regression wouldn’t hurt. Again, most of the material you’ll find will be tough to decipher. Here we’ll keep it as simple as we can.

For both techniques the predicted value you want to generate is a probability, a number that varies between 0 and 1.  In this example, that value will represent the probability that a record has given $31 or more lifetime to the college.

Now here’s the rub, the logistic regression model will always generate a probability value that varies between 0 and 1. However, the multiple regression model will almost always generate a value that varies between something less than 0 (a negative number) and a number greater than 1. In fact, in this example the range of probability values for the logistic regression model extends from .037 to .948. The range of probability values for the multiple regression model extends from -.122 to 1.003.

(By the way, this is why so many statisticians advise the use of logistic regression over multiple regression when the dependent variable is binary. In essence they are saying, “A probability value can’t exceed 1 nor can it be less than 0. Since multiple regression often yields values less than 0 and greater than 1, use logistic regression.” To be fair, we’re exaggerating a bit, but not very much.)

3. Draw three random subsamples of 20 records each from the total sample so that each subsample includes the predicted multiple regression probability value and the predicted logistic regression probability value for all 20 records.

The size and number of these subsamples is, of course, arbitrary. We decided that three subsamples were better than two and that four or more would be overkill. Twenty records, as you’ll see a bit further on, is a number that allows you to see patterns in a table or graph without overcrowding the picture.

4. Display each subsample of these records in a table and a graph.

Tables 1-3 and Figures 1-3 below show how we took this step for our example. To make sure we’re being clear, let’s go through some of the details in Table 1 and Figure 1 (which we constructed for the first subsample of twenty randomly drawn records).

In Table 1 the probability values for multiple regression for each record are displayed in the left-hand column. The corresponding probability values for the same records for logistic regression are displayed in the right-hand column. For example, the multiple regression probability for the first record is .078827109. The record’s logistic regression probability is .098107437. In plain English, that means the multiple regression model for this example is saying that this particular alum has about eight chances in a hundred of giving $31 or more lifetime. The logistic regression model is saying that the same alum has about ten chances in a hundred of giving $31 or more lifetime.

Table 1: Predicted Probability Values Generated from Using Multiple Regression and Logistic Regression for the First of Three Randomly Drawn Subsamples of 20 Records

Figure 1 shows the pairs of values you see in Table 1 displayed graphically in a scatterplot. You’ll notice that the points in the scatterplot appear to fall along what roughly looks like a straight line. This means that the multiple regression model and the logistic regression model are assigning very similar probabilities to each of the 20 records in the subsample. If you study Table 1, you can see this trend, but the trend is much easier to discern in the scatter plot.

Table 2: Predicted Probability Values Generated from Using Multiple Regression and Logistic Regression for the Second of Three Randomly Drawn Subsamples of 20 Records

Table 3: Predicted Probability Values Generated from Using Multiple Regression and Logistic Regression for the Third of Three Randomly Drawn Subsamples of 20 Records

 

5. Do an eyeball comparison of the probability values in both the tables and the graphs.

We’ve already done such a comparison in Table 1 and Figure 1. If we do the same comparison for Tables 2 and 3 and for Figures 2 and 3, it’s pretty clear that we’ll come to the same conclusion: Multiple regression and logistic regression (for this example) are giving us very similar answers.

So Where Does This All Take Us?

We’d like to cover several topics in this closing section:

  • A frequent objection to using multiple regression versus logistic regression when the dependent variable is binary
  • Trying our approach on your own
  • The conclusion we think you’ll eventually arrive at
  • How we’ve just scratched the surface here

A frequent objection to using multiple regression versus logistic regression when the dependent variable is binary

Earlier we said that many statisticians seem to advise the use of logistic regression over multiple regression by invoking this logic: “A probability value can’t exceed 1 nor can it be less than 0. Since multiple regression often yields values less than 0 and greater than 1, use logistic regression.” We also said we were exaggerating the stance of these statisticians a bit (but not very much).

While we can understand this argument, our feeling is that, in the applied fields we toil in, that argument is not a very practical one. In fact a seasoned statistics professor we know says (in effect): “What’s the big deal? If multiple regression yields any predicted values less than 0, consider them 0. If multiple regression yields any values greater than 1, consider them 1. End of story.” We agree.

Trying our approach on your own

In this piece we’ve shown the results of one comparison between multiple and logistic regression on one set of data. It’s clear that the results we got for the two techniques were very similar. But does that mean we’d get such similar results with other examples? Not necessarily.

So here’s what we’d recommend. Try doing your own comparisons of the two techniques with:

  • Different data sets. If you’re a higher education institution, you might pick a couple of data sets, one for alums who’ve been out for more than 25 years and one for folks who’ve been out less than 10 years. If you’re a non-profit, you can use a set of members from the west coast and one from the east coast.
  • Different variables. Try different binary dependent variables like those we mentioned earlier: major giver/non major giver, volunteer/non-volunteer, reunion attender/non-reunion attender, etc. And try different predictors. Try to mix categorical variables like marital status with quantitative variables like age. If you’re comfortable with more sophisticated stats, try throwing in cross products and exponential terms.
  • Different splits in the dependent variable. In our example piece the dependent variable was almost an exact 50/50 split. Since the underlying variable we used was quantitative (lifetime giving), we could have adjusted those splits in a number of ways: 60/40, 75/25, 80/20, 95/5, and on and on the list could go. Had we tried these different kinds of splits, would we have the same kinds of results for the two techniques? Since we actually did look at different splits like these, we can report that the results for both techniques were pretty much the same. But that’s for this example. That could change with a different data set and different variables.

The conclusion we think you’ll eventually arrive at

We’re very serious about having you compare multiple regression and logistic regression on a variety of data sets with a variety of variables and with different splits in the dependent variable. If you do, you’ll learn a ton. Guaranteed.

On the other hand, if we put ourselves in your shoes, it’s easy to imagine your saying, “Come on guys. I’m not gonna do that. Just tell me what you think about which technique is better when the dependent variable is binary. Pick a winner.”

Given our experience, we can’t pick a winner. In fact, if pushed, we’re inclined to opt in favor of multiple regression for a couple of reasons. It not only seems to perform about as well as logistic regression, but more importantly (with the stats software we use) multiple regression is simply faster and easier to use than logistic regression. But we still use logistic regression for models with dependent variables. And we continue to compare its efficacy against multiple regression when we can. And we rarely see a meaningful difference between the results.

Why do we still use both modeling techniques? Because we think taking a hard and fast stance when you’re doing applied science is not a good idea. Too easy to end up with egg on your face. Our best advice is to use whichever method is most familiar and readily available to you.

As always, we welcome your comments and reactions. Maybe even more so with this one.

Older Posts »

The Silver is the New Black Theme. Create a free website or blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 1,068 other followers