# CoolData blog

## 29 June 2010

### Choosing the right flavour of regression

Filed under: Model building, regression, Statistics, Terminology, Uncategorized — Tags: , , — kevinmacdonell @ 5:50 am

(Creative Commons license. Click image for source.)

I use two types of regression analysis to build predictive models: multiple linear regression and binary logistic regression. Both are called “regression”, but they are very different animals. You can use either one to build a model, but which one is best for fundraising models?

The answer is that there is no best option that applies across the board. It depends on what you’re trying to predict, certainly, but even more so it depends on the data itself. The best option will not be obvious and will be revealed only in testing. I don’t mean to sound careless about proper statistical practice, but we work in the real world: It’s not so much a question of “which tool is most appropriate?” as “Which tool WORKS?”

One of the primary differences between the two types of regression is the definition of the dependent variable. In logistic regression, this outcome variable is either 1 or 0. (There are other forms of logistic regression with multiple nominal outcomes, but I’ll stick to binary outcomes for now.) An example might be “Is a donor / Is not a donor,” or “Is a Planned Giving expectancy/ Is not a planned giving expectancy.”

In multiple regression, the dependent variable is typically a continuous value, like giving expressed in real dollars (or log-transformed dollars). But the DV can also be a 0/1 value, just as in logistic regression. Technically using a binary variable violates one of the assumptions underlying multiple regression (a normal probability distribution of the DV), but that doesn’t necessarily invalidate the model as a powerful predictive tool. Again, what works?

Another difference, less important to my mind, is that the output of a multiple regression analysis is a predicted value that reflects the units (say, dollars) of the DV and may be interpretable as such (predicted lifetime giving, predicted gift amount, etc.), while the output of a logistic regression is a probability value. My practice is to transform both sorts of outputs into scores (deciles and percentiles) for all individuals under study; this allows me to refer to both model outputs simply as “likelihood” and compare them directly.

So which to use? I say, use both! If you want some extra confidence in the worth of your model, it isn’t that much trouble to prepare both score sets and see how they compare. The key is having a set of holdout cases that represent the behaviour of interest. If your model is predicting likelihood to become a Planned Giving expectancy, you first set aside some portion of existing PG expectancies, build the model without them, then see how well the model performed at assigning scores to that holdout set.

You can use this method of validation when you create only one model, too. But where do you set the bar for confidence in the model if you test only one? Having a rival model to compare with is very useful.

In my next post I will show you a real-world example, and explain how I decided which model worked best.

## 19 April 2010

### How high, R-squared?

Filed under: Model building, Pitfalls, regression, Terminology — Tags: , — kevinmacdonell @ 10:24 am

A target value for R-squared is not chiseled in stone. (Creative Commons license. Click image for source.)

A couple of years ago there was a discussion on the Prospect-DMM list about the perceived importance of the adjusted R-squared term in building predictive models using multiple regression. What’s the magic number that tells you your model is a good fit for your data?

R-squared is an overall measure of the success of a regression in predicting your dependent variable from your independent variable(s). Adjusted R-squared is the more commonly-cited statistic when you’re using multiple predictors, because it accounts for the number of predictors in the equation (it’s usually lower than your result for non-adjusted R-squared). Data Desk expresses R-squared as a percent, so .345 is the same as 34.5%.

R-squared sometimes gives rise to some mistaken ideas and strange claims, in my opinion. One idea is that when it comes to R-squared, you have to shoot for a very high result, the higher the better. And therefore any predictor variable is good to use as long as it increases R-squared. Perhaps has a result, you’ll sometimes encounter claims of R values up around 60 percent (which I understand can happen) or even 80 percent.

Until I see some accounting for these results I’m taking them with a grain of salt. I’m thrilled when I see R-squared rise from 15% to 20%. My Phonathon model reached 25.4%, which I was very happy with. With a general giving model I can almost reach 40%. This tells me that my regression equation is accounting for, or “explaining”, about 40% of the variability in my DV. In this business, we’re making predictions about human behaviour, not the workings of physical systems, so to get to this level of insight from nothing is a big win.

If someone tells me they reached 40% for their model, I say that’s excellent. At 50%, though, I start to get suspicious. Anything beyond 60%, I just don’t buy at all.

What am I suspicious of? I’m suspicious that their independent variables are just stand-ins for their dependent variable. They are using ‘giving’ to predict ‘giving’ – a basic no-no. For example, I said earlier that my Phonathon model had an adjusted R-squared of 25.4%. Let’s say I create a new variable called ‘has giving’, and that I define this as an indicator variable, so that it has a value of 1 if the person has any giving via the Phonathon, and zero if not. When I put that variable into the regression as a predictor, my adjusted R-squared leaps from 25.4% to 93.0%!

Fantastic, right? Wrong! What if you came up with an equation that stated “Y is equal to Y”? Would that be amazing? No. It’s true, but it’s not interesting, and it has no predictive value. It’s like walking into a dark alley at night and finding your way using a mirror instead of a flashlight.

It can be more subtle than that. An example … We put on a donor-recognition gala every spring, and we hand out awards to people who reach certain milestones for longevity of giving. Both gala attendance and award status are coded in our database. Combined, these codes pertain to only 1.5% of the population – several hundred individuals. Even though we would expect the effect to be small, adding these two variables as predictors boosts R-squared (adjusted) in my Phonathon model by a full percentage point, to 26.4%.

This is quite significant, considering that by this point my model is mature – it’s full as a tick with variables! But it’s not good news at all. I would never use gala attendance or award status to predict giving, because both variables are merely stand-ins for giving itself. (Peter Wylie refers to them as ‘proxy variables’.) If my DV were predicting something else – a binary outcome for ‘major donor / not major donor’, say – then maybe you’d consider using one or both of them. But not when the DV is ‘giving’ itself.

If I take care to ensure that my independent variables are indeed not stand-ins for my dependent variable, then I’m going to get lower R as a matter of course. There are all kinds of legitimate ways to obtain a more robust model. Non-anonymous surveying of a broad swath of alumni is one of the best. If you can add all kinds of current, attitude-based data to the historical data already present in the database, I figure you’ll have gone almost as far as possible in modeling this aspect of human behaviour without attaching electrodes to people’s heads. But don’t expect to fit your model to the 60% level; if you are, you’re probably making a big mistake.

You might ask, “Doesn’t this caution about non-independence of predictors apply to a lot of other variables?” For example, it may be that many of the business phone numbers you’ve got in your database are a result of a gift transaction, and therefore the variable is not independent of giving. This is a good point, and there are grounds for debate. I subscribe to the position taken by Josh Birkholz in his book, “Fundraising Analytics.” In his discussion of the issue on page 190, he draws the “use/don’t use” line between variables that exist solely because of the behaviour you’re predicting (eg. giving), and variables that exist partially because because of the behaviour. Where you’ll draw the line might differ from project to project.

Using ‘business phone present’ as an example: Do 100% of the ‘business phone present’ cases have giving? Probably not. Those numbers probably came from various sources over the years, and they hold genuine predictive power.

So, what’s the magic number that tells you your model is a good fit for your data? I don’t think there’s an answer to that question, because I don’t think you can compare different models using R-squared. My old general models used to reach nearly 40%, but my phonathon models, which reach barely 25% are FAR superior in their applicability to the task of predicting what we need to predict.

Use R-squared during your model-building to decide when to stop adding IVs to your regression, and then forget about it. If you want assurances about the effectiveness of your model (and you should), then test against a hold-out sample before you deploy your scores. And then after you deploy, mark a day on the calendar in the future when you will analyze how actual results break down by predictive score.

## 25 February 2010

### Data mining and predictive modeling, what’s the difference?

Filed under: Terminology — Tags: , , — kevinmacdonell @ 1:28 pm

I use these terms interchangeably, but not because they mean exactly the same thing. When I refer to “data mining,” I’m usually just trying to use a term that sounds familiar to an audience. It’s a buzzword that’s been around a long time. But what I probably mean to say is predictive modeling.

What’s the difference? There are plenty of definitions available for both terms, but in my regular usage I think of data mining as any activity that involves exploring large data sets for patterns or to answer specific questions (which may or may not have anything to do with predicting behaviour). For example, the work that annual giving managers do when they use certain criteria to allocate alumni to by-mail or phone channels, or create a myriad of calling groups for phonathon, classifies as data mining, as far as I’m concerned. This work might be done right in the database, in a spreadsheet, or with statistical software.

I like to be able to tell people who are new to predictive modeling that they probably already “do” data mining, if they plow through data as part of their regular work. They’re just a conceptual step or two away from understanding predictive modeling.

Data mining might also be the right term to describe the exploration of variables for correlation with giving, which naturally shades into the actual creation of predictive models for giving. Predictive modeling itself, though, is the creation of formulas that produce scores for each constituent in a database for the purpose of predicting that constituent’s probability of engaging in a certain behaviour (eg., giving to the Annual Fund).

That’s a clunky definition, and it sounds really complicated. But keep in mind that the tools we use to accomplish this (a computer, statistical software, and statistical methods such as regression) do all the work, and we never need to see the actual formula or the underlying math. Our main tasks are to ensure the quality and relevance of the data, determine exactly what we’re trying to predict, choose our predictors using some common sense, and then finally export the predicted scores that result from the analysis (and then, preferably, load them into our database).

These thoughts about terminology were sparked by a piece written by Tonya Balan, manager of the analytics product management team for SAS. As I said, there are definitions for this stuff all over the web, but Balan does a nice job of drawing distinctions between all the terms we often hear thrown around: analytics, data mining, predictive modeling, predictive analytics, forecasting and so on.