# CoolData blog

## 9 June 2010

### Why multiple regression?

Filed under: Model building, regression, Statistics — Tags: , , , — kevinmacdonell @ 6:11 am

Not long ago I wrote about Pearson’s r, also known as Pearson’s Product-Moment Correlation Coefficient. This is a convenient statistical tool available in any stats software program (Excel can calculate it too) that yields a numerical measure of the strength of the correlation (linear dependence) between any two variables, X and Y.

I will show you how you can easily create a predictive score using only Pearson’s r — and why you probably shouldn’t!

Pearson’s r points the way toward weighting your predictor variables, according to how strongly correlated with your predicted value they actually are. If you assume all your predictor variables are worth the same (“1” for positive predictors and “-1” for negative predictors) you are imposing a subjective weighting on your variables. (I wrote about this limitation of the simple score in my most recent post, Beyond the simple score.)

Is Homecoming attendance more or less predictive than presence of an email address? Pearson’s r will tell you. Look at the example in the table above. Ten common predictor variables are listed in order of their strength of correlation with Lifetime Giving (log-transformed). It just so happens that the highest two correlations are negative; even though they are negative, according to their absolute value they are stronger correlations than any of the others in the list, so I put them at the top.  According to these values, “Email present” is relatively weak compared with “Number of Homecomings attended.”

Do you see where this is going? If you wanted to, you could use these correlations to directly create weighted scores for everyone in your database. Just multiply each variable by its Pearson value, sum up the products, and bingo — there’s your score.

You could do that, but I don’t think you should.

At least two of the predictor variables we want to use in our score are very closely related to each other: Employer present and Position present. They aren’t exactly alike: For some constituents you will have one piece of information and not the other. But on the whole, if one is present in your database for any given constituent, chances are you’ve got the other as well.

In other words, if you include both variables in your score, you’re double-counting the effect of employment information in your model — despite the fact that each is properly weighted by Pearson score. The reason is that Pearson’s r treats only two variables at a time: X and Y. It does NOT account for any interactions between multiple Xs.

Employment variables are only an obvious example. All of your variables will interact with each other to some degree, some strongly, others more subtly. By “interact with each other” I mean “correlate with each other.” In fact, we can use Pearson’s r to show which combinations of predictor variables are strongly correlated. The table below lists three variable pairs, drawn from real data, that exhibit strong interactions — including the employment example we’ve just mentioned.

The Pearson value for the employment variables is very close to 1, which indicates a nearly perfect positive correlation. The other two are more subtle, but make sense as well: Younger alumni will tend to be coded as Single in your database, and if we have a job title for someone, chances are we’ll also have a business phone number as well.

This overlapping of the explanatory effect of various X’s on Y will interfere with our ability to properly weight our predictors. Pearson’s Product-Moment Correlation Coefficient is important for understanding our variables, but not quite up to the task of directly created predictive scores. What now?

Well — multiple regression! Only regression will account for interactions among our predictor Xs, recalculating the coefficients (weightings) on the fly each time we add a new predictor variable. Working from the Pearson list at the top of this post, we would add Class year, Single, and Employer present to our regression window one by one. Everything would be fine up to that point; our p-values will be very low for these variables. When we add Position present, however, the p-value will be too high (0.183, which exceeds the rule-of-thumb value of 0.05), and R squared will fail to improve. We would therefore leave Position present out of the regression because it isn’t adding any new predictive information to the model and might interfere with the effectiveness of other variables in subtle and strange ways.

Often when I use the word “regression” on someone, what I see reflected back in their eyes is fear. (I really need to reserve that word for people I don’t like.) I wish, though, that people could see that regression is a bit like an automobile: A complex machine with many moving parts, but familiar and approachable, with a simple and comprehensible purpose, and above all operable by just about anyone.