by Peter Wylie, John Sammis and Kevin MacDonell
(Click to download printer-friendly PDF: Logistic vs MR-Wylie Sammis MacDonell)
The three of us talk about this issue a lot because we encounter a number of situations in our work where we need to choose between these two techniques. Many of our late night/early morning phone/internet discussions have been gobbled up by talking about which technique seems to be better under what circumstances. More than a few times, I’ve suggested we write something up about our experience with both techniques. In the end we’ve always decided to put off doing that because … well, because we’ve thought it might put a lot of people to sleep. Disagree as we might about lots of things, we’re of one mind on the dictum: “Don’t bore people.” They have enough tedious stuff in their lives; we don’t need to add to their burden.
On the other hand, as analytics has started to sink its teeth more and more into the world of advancement, it seems there is a group of folks out there who wrestle with the same issue. And the issue seems to be this:
“If I have a binary dependent variable (e.g., major giver/ non major giver, volunteer/non-volunteer, reunion attender/non-reunion attender, etc.), which technique should I use? Logistic regression or multiple regression?”
We considered a number of ways to try to answer this question:
- We could simply assert an opinion based on our bank of experience with both techniques.
- We could show you the results of a number of data sets using both techniques and then offer our opinion.
- We could show you a way to compare both techniques using some of your own data.
We chose the third option because we think there is no better way to learn about a statistical technique than by using the technique on real data. Whenever we’ve done this sort of exploring ourselves, we’ve been humbled by how much we’ve learned.
Before we show you a way to compare the two techniques, we’ll offer some thoughts on why this question (“Should I use logistic regression or multiple regression?”) is so tough to find an answer to. If you’re anxious to move on to our comparison process, you can skip this section. But we hope you don’t.
Why This Is Not an Easy Question to Find an Answer To
We see at least two reasons why this is so:
- Multiple regression has lived in the neighborhood a long time; logistic regression is a new kid on the block.
- The articles and books we’ve read on comparisons of the two techniques are hard to understand.
Multiple regression is a longtime resident; logistic regression is a new kid on the block.
When World War II came along, there was a pressing need for rapid ways to assess the potential of young men (and some women) for the critical jobs that the military services were trying to fill. It was in this flurry of preparation that multiple regression began to see a great deal of practical application by behavioral scientists who had left their academic jobs and joined up for the duration. The theory behind multiple regression had been worked out much earlier in the century by geniuses like Ronald Fisher, Karl Pearson, and Edward Hotelling. But the method did not get much use until the war effort necessitated that use. The computational effort involved was just too forbidding.
Logistic regression is a different story. From the reading we’ve done, logistic regression got its early practical use in the world of medicine where biostatisticians were trying to predict binary outcomes like survived/did not survive, contracted disease/did not contract disease, had a coronary event/did not have a coronary event, and the like. It’s only been within the last fifteen or twenty years that logistic regression has found its way into the parlance of statisticians in the behavioral sciences.
These two paragraphs are a long way around of saying that logistic regression is (in our opinion) nowhere near as well vetted as is multiple regression by people like us in advancement who are interested in predicting behavior, especially giving behavior.
The articles and books we’ve read on comparisons of the two techniques are hard to understand.
Since I (Peter) was pushing to do this piece, John and I decided it would be my responsibility to do some searching of the more recent literature on logistic regression as it relates to the substance of this project.
To start off, I reread portions of texts I have accumulated over the years that focus on multiple regression as a general data analytic technique. Each text has a section on logistic regression. As I waded back into these sections, I asked myself: “Is what I’m reading here going to enlighten more than confuse the folks we have in mind for this piece?” Without exception, my answer was, “Nope, just the reverse.” There was altogether too much focus on complicated equations and theory and nowhere near enough emphasis on the practical use of logistic regression. (This, in spite of the fact that each text had an introduction ensuring us the book would go light on math and heavy on application.)
Then, using my trusty iPad, I set about seeing what I could find on the web. Not surprisingly, I found a ton of articles (and even some full length books) that had found their way into the public domain. I downloaded a bunch of them to read whenever I could find enough time to dig into them. I’m sorry to report that each time I’d give one of these things a try, I would hear my father’s voice (dad graduated third in his class in engineering school) as he paged through my own science and math texts when I was in college: “They oughta teach the clowns who wrote these things to write in plain English.” (I always tried to use such comments as excuses for bad grades. Never worked.)
Levity aside, it is hard to find clearly written articles or books on the use of logistic versus multiple regression in the behavioral sciences. I think it’s a bad situation that needs fixing, but that fixing won’t occur anytime soon. On the other hand, I think dad was right not to let me off easy for giving up on badly written material. And you shouldn’t let my pessimism dissuade you from trying out some of these same articles and books. (If enough of you are interested, perhaps Kevin and John and I can put together a list of suggested readings.)
A Way to Compare Logistic Regression with Multiple Regression
As promised we’ll take you through a set of steps you can use with some of your own data:
- Pick a binary dependent variable and a set of predictors.
- Compute a predicted probability value for every record in your sample using both multiple regression and logistic regression.
- Draw three random subsamples of 20 records each from the total sample so that each subsample includes the predicted multiple regression probability value and the predicted logistic regression probability value for every record.
- Display each subsample of these records in a table and a graph.
- Do an eyeball comparison of the probability values in both the tables and the graphs.
1. Pick a binary dependent variable and a set of predictors.
For this example, we used a private four year institution with about 13,000 solicitable alums. Here are the variables we chose:
Dependent variable. Each alum who had given $31 or more lifetime was defined as 1, all others who had given less than that amount were defined as 0. There were 6,293 0’s and 6,204 1’s. Just about an even fifty/fifty split.
Predictor variables:
- CLASS YEAR
- SQUARE OF CLASS YEAR
- EMAIL ADDRESS LISTED (YES/NO, 1=YES, 0=NO)
- MARITAL STATUS (SINGLE =1, ALL OTHERS=0)
- HOME PHONE LISTED (YES/NO, 1=YES, 0=NO)
- UNIQUE ID NUMBER
Why did we use ID number as one of the predictors? Over the years we’ve found that many schools use all-numeric ID numbers. When these numbers are entered into a regression analysis, they often work as predictors. More importantly, they help to create very granular predicted scores that can easily be binned into equal size groups.
2. Compute a predicted probability value for every record in your sample using both multiple regression and logistic regression.
This is where things start to get a bit technical and where a little background reading on both multiple regression and logistic regression wouldn’t hurt. Again, most of the material you’ll find will be tough to decipher. Here we’ll keep it as simple as we can.
For both techniques the predicted value you want to generate is a probability, a number that varies between 0 and 1. In this example, that value will represent the probability that a record has given $31 or more lifetime to the college.
Now here’s the rub, the logistic regression model will always generate a probability value that varies between 0 and 1. However, the multiple regression model will almost always generate a value that varies between something less than 0 (a negative number) and a number greater than 1. In fact, in this example the range of probability values for the logistic regression model extends from .037 to .948. The range of probability values for the multiple regression model extends from -.122 to 1.003.
(By the way, this is why so many statisticians advise the use of logistic regression over multiple regression when the dependent variable is binary. In essence they are saying, “A probability value can’t exceed 1 nor can it be less than 0. Since multiple regression often yields values less than 0 and greater than 1, use logistic regression.” To be fair, we’re exaggerating a bit, but not very much.)
3. Draw three random subsamples of 20 records each from the total sample so that each subsample includes the predicted multiple regression probability value and the predicted logistic regression probability value for all 20 records.
The size and number of these subsamples is, of course, arbitrary. We decided that three subsamples were better than two and that four or more would be overkill. Twenty records, as you’ll see a bit further on, is a number that allows you to see patterns in a table or graph without overcrowding the picture.
4. Display each subsample of these records in a table and a graph.
Tables 1-3 and Figures 1-3 below show how we took this step for our example. To make sure we’re being clear, let’s go through some of the details in Table 1 and Figure 1 (which we constructed for the first subsample of twenty randomly drawn records).
In Table 1 the probability values for multiple regression for each record are displayed in the left-hand column. The corresponding probability values for the same records for logistic regression are displayed in the right-hand column. For example, the multiple regression probability for the first record is .078827109. The record’s logistic regression probability is .098107437. In plain English, that means the multiple regression model for this example is saying that this particular alum has about eight chances in a hundred of giving $31 or more lifetime. The logistic regression model is saying that the same alum has about ten chances in a hundred of giving $31 or more lifetime.
Table 1: Predicted Probability Values Generated from Using Multiple Regression and Logistic Regression for the First of Three Randomly Drawn Subsamples of 20 Records
Figure 1 shows the pairs of values you see in Table 1 displayed graphically in a scatterplot. You’ll notice that the points in the scatterplot appear to fall along what roughly looks like a straight line. This means that the multiple regression model and the logistic regression model are assigning very similar probabilities to each of the 20 records in the subsample. If you study Table 1, you can see this trend, but the trend is much easier to discern in the scatter plot.
Table 2: Predicted Probability Values Generated from Using Multiple Regression and Logistic Regression for the Second of Three Randomly Drawn Subsamples of 20 Records
Table 3: Predicted Probability Values Generated from Using Multiple Regression and Logistic Regression for the Third of Three Randomly Drawn Subsamples of 20 Records
5. Do an eyeball comparison of the probability values in both the tables and the graphs.
We’ve already done such a comparison in Table 1 and Figure 1. If we do the same comparison for Tables 2 and 3 and for Figures 2 and 3, it’s pretty clear that we’ll come to the same conclusion: Multiple regression and logistic regression (for this example) are giving us very similar answers.
So Where Does This All Take Us?
We’d like to cover several topics in this closing section:
- A frequent objection to using multiple regression versus logistic regression when the dependent variable is binary
- Trying our approach on your own
- The conclusion we think you’ll eventually arrive at
- How we’ve just scratched the surface here
A frequent objection to using multiple regression versus logistic regression when the dependent variable is binary
Earlier we said that many statisticians seem to advise the use of logistic regression over multiple regression by invoking this logic: “A probability value can’t exceed 1 nor can it be less than 0. Since multiple regression often yields values less than 0 and greater than 1, use logistic regression.” We also said we were exaggerating the stance of these statisticians a bit (but not very much).
While we can understand this argument, our feeling is that, in the applied fields we toil in, that argument is not a very practical one. In fact a seasoned statistics professor we know says (in effect): “What’s the big deal? If multiple regression yields any predicted values less than 0, consider them 0. If multiple regression yields any values greater than 1, consider them 1. End of story.” We agree.
Trying our approach on your own
In this piece we’ve shown the results of one comparison between multiple and logistic regression on one set of data. It’s clear that the results we got for the two techniques were very similar. But does that mean we’d get such similar results with other examples? Not necessarily.
So here’s what we’d recommend. Try doing your own comparisons of the two techniques with:
- Different data sets. If you’re a higher education institution, you might pick a couple of data sets, one for alums who’ve been out for more than 25 years and one for folks who’ve been out less than 10 years. If you’re a non-profit, you can use a set of members from the west coast and one from the east coast.
- Different variables. Try different binary dependent variables like those we mentioned earlier: major giver/non major giver, volunteer/non-volunteer, reunion attender/non-reunion attender, etc. And try different predictors. Try to mix categorical variables like marital status with quantitative variables like age. If you’re comfortable with more sophisticated stats, try throwing in cross products and exponential terms.
- Different splits in the dependent variable. In our example piece the dependent variable was almost an exact 50/50 split. Since the underlying variable we used was quantitative (lifetime giving), we could have adjusted those splits in a number of ways: 60/40, 75/25, 80/20, 95/5, and on and on the list could go. Had we tried these different kinds of splits, would we have the same kinds of results for the two techniques? Since we actually did look at different splits like these, we can report that the results for both techniques were pretty much the same. But that’s for this example. That could change with a different data set and different variables.
The conclusion we think you’ll eventually arrive at
We’re very serious about having you compare multiple regression and logistic regression on a variety of data sets with a variety of variables and with different splits in the dependent variable. If you do, you’ll learn a ton. Guaranteed.
On the other hand, if we put ourselves in your shoes, it’s easy to imagine your saying, “Come on guys. I’m not gonna do that. Just tell me what you think about which technique is better when the dependent variable is binary. Pick a winner.”
Given our experience, we can’t pick a winner. In fact, if pushed, we’re inclined to opt in favor of multiple regression for a couple of reasons. It not only seems to perform about as well as logistic regression, but more importantly (with the stats software we use) multiple regression is simply faster and easier to use than logistic regression. But we still use logistic regression for models with dependent variables. And we continue to compare its efficacy against multiple regression when we can. And we rarely see a meaningful difference between the results.
Why do we still use both modeling techniques? Because we think taking a hard and fast stance when you’re doing applied science is not a good idea. Too easy to end up with egg on your face. Our best advice is to use whichever method is most familiar and readily available to you.
As always, we welcome your comments and reactions. Maybe even more so with this one.
Interesting exercise and I agree it probably won’t often make a difference. However I’m still not sure as to why you wouldn’t use logistic regression in these cases – that is kind of what it’s designed for, so I’m not sure I buy the argument that it’s just a bit easier to use multiple regression…it’s not that hard to do really…. and if you do get cases where the multiple regression does return values outside 0-1, won’t forcing them to 0 and 1 look a bit odd, flattening the tails off? After all 0.05 isn’t 0… it does imply some probability. Just out of interest how do you interpret the ID number as a dependent variable – it may be significant but what does it mean?
Comment by Tom Lloyd (@metametricsltd) — 20 August 2012 @ 6:11 am
Hi Tom, thanks for the comment. Peter and I agree with you that for those folks who are comfortable using logistic regression they should use logistic regression. We encounter many people who are not comfortable with logistic regression, but have some facility with multiple regression. Those people should feel confident that they can produce good models with multiple regression.
In terms of the predicted probabilities outside the range, we teach our clients to bin the probabilities into Score Levels so we don’t have the bumping issues.
We sometimes use Unique Database IDs as an independent variable, not a dependent variable. Because we are building a predictive model and not a causation model we aren’t typically concerned about understanding why a variable is significant. We don’t try to understand enough about each dependent variable to ensure that we are not using independent variables that are proxies for the dependent variable. Nevetheless, one can imagine that the Unique Database ID variable could hold some information about order of entry into the database.
Comment by John Sammis — 21 August 2012 @ 6:11 pm
Very interesting blog session. Even if theory of large numbers and countless master’s thesis have shown that you can use continuous methodology to binary data; I was impress by how close the results were in terms of fit when playing with real data (cool cooldatablog!). I would, however, make the following comments. First, it was assumed that the same covariates would appear in the model (for comparison purposes). This however assumes that you already had a parsimonious model in mind. I wonder if the two approaches would obtain the same covariates or if it is not likely that different covariates would creep in depending on the approach. One advantage of logistic regression modeling is that you can easily derive odds ratios which you cannot in linear regression (unless making assumptions and modifications). In exchange and unlike others, I like the fact that the linear approach permits probabilities to fall outside of the [0;1] universe as it speaks about certainty. Under this framework, I would not impute zero for negative and would not impute 1 for >1 probabilities (sign of increase in contribution, etc). For example (on a related but different subject), negative net worth entities usually have much more wealth than their equivalent positive net worth entities.
One final note. When looking at your 3 graphs the relationship between the two probabilities are clearly linear and directly proportional (again suggesting substitution between approaches) with exceptions near zero (prob70%) where there are signs of non-linearity and saturation. Given that advancement services are usually mainly much more interested in the extremes (small and high probs); one would maybe have to be careful about the choice of approach.
Cheers!
jsb
Comment by jsb — 21 August 2012 @ 8:25 am
oups… Last paragraph should have read
“One final note. When looking at your 3 graphs the relationship between the two probabilities are clearly linear and directly proportional (again suggesting substitution between approaches) with exceptions near zero (prob70%) where there are signs of non-linearity and saturation. Given that advancement services are usually mainly much more interested in the extremes (small and high probs); one would maybe have to be careful about the choice of approach.”
Really do miss the advancement arena!
Comment by jsb — 21 August 2012 @ 8:34 am
Like jsb, when building a model from the ground up, I would be concerned that predictors might not behave the same depending on whether you use multiple regression, or logistic regression. After all, the increase in probability of a 1 over a 0 tends to change logarithmically, not linearly, to the extent that its relationship with a covariate is significant. Hence if you plot that jump in probability, it looks more like an S than a diagonal line.
That being said, the example shown here is pretty cool in that it does argue that the final results could be more or less the same. Once I see enough evidence across the board that this is the case, then maybe I’ll be convinced. Until then, I’m sticking with logistic regression to predict a binary dependent variable 🙂
Comment by inkhorn82 — 21 August 2012 @ 1:35 pm
I found a good example of a situation where linear modeling is much more inappropriate than logistic modeling. Check out the following picture link:
See how there’s really no big jump in the probability of a 1 until a value of 7 on the X axis, and no more significant jumping in probability past a value of 14. What would happen if you modeled event probability using OLS? Well, I think you’d get an over estimation at many of the earlier values of the X variable, and an under estimation at many of the later values.
Comment by inkhorn82 — 21 August 2012 @ 9:13 pm
I think you are quite right about over- and under-estimation using linear modelling in predicting probability of a binary event. But so much of our work involves ranking individuals (prospects), from most likely to engage in the behaviour of interest to least likely. It is that ranking which essentially does not change from one method to the other.
Comment by kevinmacdonell — 27 August 2012 @ 5:00 am
“…logistic regression is (in our opinion) nowhere near as well vetted as is multiple regression by people like us in advancement who are interested in predicting behavior, especially giving behavior.”
Fitting lines to data by minimizing squared error dates back to Gauss in 1795. Linear regression using absolute errors is actually even older than that, having been introduced by Boscovich in 1757. The logistic curve was originally used in 1838 by Quetelet and Verhulst, for fitting population proportions. Naturally, considerable theoretical and practical development has emerged for both functional forms since.
I can only imagine what you think of things like neural networks or kernel regression.
Comment by Will Dwinnell — 27 August 2012 @ 9:28 am
Thanks Will. You’re confusing a comment about practitioners’ use of techniques with a critique of the technique itself. I defer to Peter and John on the details, but my take is that we’re not saying that logistic regression doesn’t have a history or hasn’t been thoroughly worked out. We’re saying that if anyone working in higher ed advancement or fundraising is familiar with predictive modeling, the tool they probably understand best is multiple linear regression. The only advantage in learning to use other, less well-understood methods would be if they offered vastly superior predictive power in a practical setting. If there are any studies that indicate logistic regression or any other method yields superior (or even simply different) results, we’d love to see them. The array of available tools is rich, but evidence is lacking that the everyday user without an advanced degree in statistics attempting to solve practical business problems needs to learn these tools. As we’ve already said, if the user is familiar with and more comfortable with using logistic regression, then they should go ahead and do so. And if the user is already an expert, then they’re not reading this blog. Our aim is to make advanced methods more accessible to non-experts, most of whom do not have the time or inclination to delve into methods that have specific, “appropriate” applications but which yield marginal benefits in a practical sense. All that said, no one with the aptitude, time and interest should be discouraged from exploring and learning as much as possible about what advantages may be gained from trying other approaches. Only, let’s be sure there ARE real and measurable advantages before we advocate for one method over another.
Comment by kevinmacdonell — 27 August 2012 @ 10:39 am
Hmmm. Interesting that the two methods sometimes give similar results. Nevertheless, I can’t endorse the idea that it’s OK to use OLS regression for binary dependent variables. Not only does OLS produce ridiculous results (which you address), but the assumptions of the model are grossly violated. And I see that someone gave you some examples where linear doesn’t work well.
It’s certainly true that many people know OLS and don’t know logistic. But so? That’s why there are experts.
Comment by Peter Flom (@peterflomStat) — 4 September 2012 @ 7:21 am
Hi Peter, and thanks … In fact, rather than “sometimes” giving similar results, the two methods frequently give similar results, when cases are ranked. That last part is key. I think what a lot of people are getting hung up on is the form of the output from these two very different methods. In the predictive modelling most frequently carried out by the authors of this piece, the final output is not a probability value or any type of specific unit such as size of gift in dollars. What we do is take the raw output (whatever it is) and use it to rank all the cases in order from most likely to least likely. When you compare rankings, in the form of percentiles or deciles, the results are similar enough to be interchangeable.
I routinely create two models for every application — one multiple linear (OLS) regression, the other binary logistic regression — and whenever I compare the rankings, I find either model would do just about as well. That’s often true even when I modify the DV so that the MLR does not have a binary variable but is a measure of the same behaviour that the binary variable is a flag for. The models handle independent variables differently, but the end result is, for practical purposes, about the same. I usually also create a hybrid score in an attempt to merge the strengths of both models, and although this approach seems to improve accuracy at the high end (for applications such as identifying relatively rare behaviours such as major giving), the overall result is not much different — which serves just fine for segmenting large populations for, say, Annual Giving.
I should add that in the work we do, the assumptions of multiple linear regression are also “grossly violated” — routinely. The residuals are never normally distributed, not with the data we’ve got. But the model scores are perennially useful (to varying degrees), with the result that nonprofits save money that would otherwise be wasted on useless appeals and so on. I think you’d agree that an organization would probably choose saving money by being guided by a model (however flawed) over flying blind — eschewing modelling only because certain assumptions could not be met.
Sometimes I think it would be helpful for predictive modellers to adopt an entirely different jargon from that of professional statisticians. We encourage fundraising and advancement non-statisticians to use powerful tools such as regression because a little knowledge is far more lucrative than it is dangerous. These non-experts should not imagine that a surface knowledge of statistical tools allows them to design drug trials or psychology experiments, but so what? No one’s going to let them.
That’s why I feel conflicted about the subject of “experts,” now that you bring it up. Our field could certainly use more experts, but what we need even more than experts are lots and lots of newbies delving into their own data, learning rapidly and iteratively through one flawed analysis after another until they are creating real value for their organizations — undaunted by the warnings. There are other fields that have greater need for experts, and are able to pay them what they are worth.
Comment by kevinmacdonell — 5 September 2012 @ 11:26 am
[…] to all of you who read and commented on our recent paper comparing logistic regression with multiple regression. We were not sure how popular this topic would be, but Kevin told us that interest was high, and […]
Pingback by Logistic vs. multiple regression: Our response to comments « CoolData blog — 10 October 2012 @ 9:05 am