# CoolData blog

## 29 June 2010

### Choosing the right flavour of regression

Filed under: Model building, regression, Statistics, Terminology, Uncategorized — Tags: , , — kevinmacdonell @ 5:50 am

(Creative Commons license. Click image for source.)

I use two types of regression analysis to build predictive models: multiple linear regression and binary logistic regression. Both are called “regression”, but they are very different animals. You can use either one to build a model, but which one is best for fundraising models?

The answer is that there is no best option that applies across the board. It depends on what you’re trying to predict, certainly, but even more so it depends on the data itself. The best option will not be obvious and will be revealed only in testing. I don’t mean to sound careless about proper statistical practice, but we work in the real world: It’s not so much a question of “which tool is most appropriate?” as “Which tool WORKS?”

One of the primary differences between the two types of regression is the definition of the dependent variable. In logistic regression, this outcome variable is either 1 or 0. (There are other forms of logistic regression with multiple nominal outcomes, but I’ll stick to binary outcomes for now.) An example might be “Is a donor / Is not a donor,” or “Is a Planned Giving expectancy/ Is not a planned giving expectancy.”

In multiple regression, the dependent variable is typically a continuous value, like giving expressed in real dollars (or log-transformed dollars). But the DV can also be a 0/1 value, just as in logistic regression. Technically using a binary variable violates one of the assumptions underlying multiple regression (a normal probability distribution of the DV), but that doesn’t necessarily invalidate the model as a powerful predictive tool. Again, what works?

Another difference, less important to my mind, is that the output of a multiple regression analysis is a predicted value that reflects the units (say, dollars) of the DV and may be interpretable as such (predicted lifetime giving, predicted gift amount, etc.), while the output of a logistic regression is a probability value. My practice is to transform both sorts of outputs into scores (deciles and percentiles) for all individuals under study; this allows me to refer to both model outputs simply as “likelihood” and compare them directly.

So which to use? I say, use both! If you want some extra confidence in the worth of your model, it isn’t that much trouble to prepare both score sets and see how they compare. The key is having a set of holdout cases that represent the behaviour of interest. If your model is predicting likelihood to become a Planned Giving expectancy, you first set aside some portion of existing PG expectancies, build the model without them, then see how well the model performed at assigning scores to that holdout set.

You can use this method of validation when you create only one model, too. But where do you set the bar for confidence in the model if you test only one? Having a rival model to compare with is very useful.

In my next post I will show you a real-world example, and explain how I decided which model worked best.

1. hey, Kevin, my VP of Operations saw your blog mentioned in CASE Currents (July/Aug) and pointed it out to me. You rock!

regarding your post, I have never used multiple regression, only logistical, but see your point exactly. I can see where there might be value in attempting to use multiple regression for some of our analyses.

Comment by Diane — 30 June 2010 @ 11:00 am

• Hi Diane – Aw, shucks! I’ve been looking around for the current issue; your comment is the first I hear that I made it into print, so thanks for that!

There are many paths in analytics … I learned about multiple regression first and logistic second. And there is always so much more to learn.

Comment by kevinmacdonell — 30 June 2010 @ 11:14 am