A short while ago I wrote about pairs of predictor variables that are highly correlated with each other, i.e. that have strong interactions in regression analysis. (Making hay when predictor variables interact.) The example I used was Position Present and Employer Name Present. Instead of using one and throwing the other away as redundant, you can combine them to form a new variable with more predictive power than either of the original two on their own.
In this post, I’ll show you how to identify other likely pairs of variables from which you can try to make similar “combination variables.”
When independent variables interact in regression, it’s called multicollinearity. (You can read a good discussion of multicollinearity on the Stats Make Me Cry blog: Top Ten Confusing Stats Terms Explained in Plain English, #9: Multicollinearity.) Position Present and Employer Name Present is an obvious example, but all kinds of subtle combinations are possible and difficult to foresee. We need to call on some help in detecting the interaction. That help is provided by Pearson’s Product-Moment Correlation, also known as Pearson’s r. I’ve written about Pearson’s r before.
In a nutshell, Pearson’s r calculates a number that describes the strength of linear correlation between any two variables. Your stats software makes this easy. In DataDesk, I select all the icons of the variables I want to calculate correlations for, then find Pearson’s r in the menu. The result is a new window containing a table full of values. If you include many variables at once, this table will be massive. Click on the image below for a full-size version of a Pearson table based on some real data from a university. (You might have to enlarge it in your browser.) The table works exactly like those distance tables you find on old tourist highway maps (they don’t seem to make those anymore — wonder why); to find the distance from, say Albuquerque to Santa Fe, you’d find the number at the intersection of the Albuquerque column with the Santa Fe row, and that would be the number of miles to travel. In the table below, the cities are variables and the mileage is Pearson’s r.
Don’t be intimidated by all the numbers! Just let your eye wander over them. Notice that some are positive, some negative. The negative sign simply means that the correlation between the two variables is negative. Notice also that most of the numbers are small, less than 0.1 (or minus 0.1). As far as multicollinearity is concerned, we’re most interested in these bigger values, i.e. values that are furthest from zero (no correlation) and closest to 1 (perfect linear correlation).
I realize some of the variable names will be a bit mysterious, but you might be able to guess that “Number deg” is Number of Degrees and that “Grad Y” means Graduated. Their Pearson’s r value (0.26) is one of the higher correlations, which makes sense, right?
Noticing certain correlations can teach you things about the data. ‘Female’ is correlated with ‘Class (Year)’ — because at this university, males outnumbered females years ago, but since the 1980s, females have outnumbered males by an ever-increasing factor. On the other hand, ‘Number HC’ (campus reunions attended) is negatively correlated with ‘Class Year’ — older alumni have attended more events (no big surprise), but also young alumni are not big on reunions at this institution.
Look at ‘Business Phone Present’ and ‘Employer Present’. Their r value is relatively high (0.376). I would test some variations of those two. You could add them together, so that the variable could range from 0 to 2. Or you could multiply them, to give you a binary variable that would have a value of 1 only if both of the original variables was 1. You might end up with a predictor that is more highly correlated with ‘Giving’ than either of the original two variables.
With non-binary variables such as ‘Class Year’ and ‘Number Events Attended’, the results of combining will be even more varied and interesting. What you do is up to you; there’s no harm in trying. When you’re done playing, just rank all your old and new variables in order by the absolute value of their strength of linear correlation with your predicted value (say, Giving), and see how the new variables fare.