Hi Peter, and thanks … In fact, rather than “sometimes” giving similar results, the two methods *frequently* give similar results, when cases are ranked. That last part is key. I think what a lot of people are getting hung up on is the form of the output from these two very different methods. In the predictive modelling most frequently carried out by the authors of this piece, the final output is *not* a probability value or any type of specific unit such as size of gift in dollars. What we do is take the raw output (whatever it is) and use it to rank all the cases in order from most likely to least likely. When you compare rankings, in the form of percentiles or deciles, the results are similar enough to be interchangeable.

I routinely create two models for every application — one multiple linear (OLS) regression, the other binary logistic regression — and whenever I compare the rankings, I find either model would do just about as well. That’s often true even when I modify the DV so that the MLR does not have a binary variable but is a measure of the same behaviour that the binary variable is a flag for. The models handle independent variables differently, but the end result is, for practical purposes, about the same. I usually also create a hybrid score in an attempt to merge the strengths of both models, and although this approach seems to improve accuracy at the high end (for applications such as identifying relatively rare behaviours such as major giving), the overall result is not much different — which serves just fine for segmenting large populations for, say, Annual Giving.

I should add that in the work we do, the assumptions of multiple linear regression are also “grossly violated” — routinely. The residuals are never normally distributed, not with the data we’ve got. But the model scores are perennially useful (to varying degrees), with the result that nonprofits save money that would otherwise be wasted on useless appeals and so on. I think you’d agree that an organization would probably choose saving money by being guided by a model (however flawed) over flying blind — eschewing modelling only because certain assumptions could not be met.

Sometimes I think it would be helpful for predictive modellers to adopt an entirely different jargon from that of professional statisticians. We encourage fundraising and advancement non-statisticians to use powerful tools such as regression because a little knowledge is far more lucrative than it is dangerous. These non-experts should not imagine that a surface knowledge of statistical tools allows them to design drug trials or psychology experiments, but so what? No one’s going to let them.

That’s why I feel conflicted about the subject of “experts,” now that you bring it up. Our field could certainly use more experts, but what we need even more than experts are lots and lots of newbies delving into their own data, learning rapidly and iteratively through one flawed analysis after another until they are creating real value for their organizations — undaunted by the warnings. There are other fields that have greater need for experts, and are able to pay them what they are worth.

]]>It’s certainly true that many people know OLS and don’t know logistic. But so? That’s why there are experts.

]]>Thanks Will. You’re confusing a comment about practitioners’ use of techniques with a critique of the technique itself. I defer to Peter and John on the details, but my take is that we’re not saying that logistic regression doesn’t have a history or hasn’t been thoroughly worked out. We’re saying that if anyone working in higher ed advancement or fundraising is familiar with predictive modeling, the tool they probably understand best is multiple linear regression. The only advantage in learning to use other, less well-understood methods would be if they offered vastly superior predictive power in a practical setting. If there are any studies that indicate logistic regression or any other method yields superior (or even simply different) results, we’d love to see them. The array of available tools is rich, but evidence is lacking that the everyday user without an advanced degree in statistics attempting to solve practical business problems needs to learn these tools. As we’ve already said, if the user is familiar with and more comfortable with using logistic regression, then they should go ahead and do so. And if the user is already an expert, then they’re not reading this blog. Our aim is to make advanced methods more accessible to non-experts, most of whom do not have the time or inclination to delve into methods that have specific, “appropriate” applications but which yield marginal benefits in a practical sense. All that said, no one with the aptitude, time and interest should be discouraged from exploring and learning as much as possible about what advantages may be gained from trying other approaches. Only, let’s be sure there ARE real and measurable advantages before we advocate for one method over another.

]]>Fitting lines to data by minimizing squared error dates back to Gauss in 1795. Linear regression using absolute errors is actually even older than that, having been introduced by Boscovich in 1757. The logistic curve was originally used in 1838 by Quetelet and Verhulst, for fitting population proportions. Naturally, considerable theoretical and practical development has emerged for both functional forms since.

I can only imagine what you think of things like neural networks or kernel regression.

]]>I think you are quite right about over- and under-estimation using linear modelling in predicting probability of a binary event. But so much of our work involves ranking individuals (prospects), from most likely to engage in the behaviour of interest to least likely. It is that ranking which essentially does not change from one method to the other.

]]>See how there’s really no big jump in the probability of a 1 until a value of 7 on the X axis, and no more significant jumping in probability past a value of 14. What would happen if you modeled event probability using OLS? Well, I think you’d get an over estimation at many of the earlier values of the X variable, and an under estimation at many of the later values.

]]>Hi Tom, thanks for the comment. Peter and I agree with you that for those folks who are comfortable using logistic regression they should use logistic regression. We encounter many people who are not comfortable with logistic regression, but have some facility with multiple regression. Those people should feel confident that they can produce good models with multiple regression.

In terms of the predicted probabilities outside the range, we teach our clients to bin the probabilities into Score Levels so we don’t have the bumping issues.

We sometimes use Unique Database IDs as an independent variable, not a dependent variable. Because we are building a predictive model and not a causation model we aren’t typically concerned about understanding why a variable is significant. We don’t try to understand enough about each dependent variable to ensure that we are not using independent variables that are proxies for the dependent variable. Nevetheless, one can imagine that the Unique Database ID variable could hold some information about order of entry into the database.

]]>That being said, the example shown here is pretty cool in that it does argue that the final results could be more or less the same. Once I see enough evidence across the board that this is the case, then maybe I’ll be convinced. Until then, I’m sticking with logistic regression to predict a binary dependent variable 🙂

]]>“One final note. When looking at your 3 graphs the relationship between the two probabilities are clearly linear and directly proportional (again suggesting substitution between approaches) with exceptions near zero (prob70%) where there are signs of non-linearity and saturation. Given that advancement services are usually mainly much more interested in the extremes (small and high probs); one would maybe have to be careful about the choice of approach.”

Really do miss the advancement arena!

]]>