You can’t have a serious blog post related to statistics without tossing in the name of a dead white guy. How dead? Well, yesterday was the 74th anniversary of Karl Pearson‘s death in Surrey, England. How white? Pearson was a fan of eugenics, social Darwinism, and the “struggle of race with race” – with the supposedly best race winning. Charming! Like him or not, his contribution to statistics (and therefore science) was huge.
I’ve written a bit about multiple linear regression, but not a great deal about how to do it. After I’ve got my data file ready and before I open up a regression window in Data Desk, ranking my predictor variables using ol’ Pearson’s measure of correlation is Step One. The object is to rank your predictor variables (a.k.a. independent variables) according to the strength (either positive or negative) of their linear correlation with your predicted value (a.k.a. dependent variable). I do this in order to determine the order in which I will introduce my predictor variables into my regression analysis. (All of the following discussion assumes that your variables are numerical; either continuous variables such as ‘class year’, or 0/1 indicator variables you’ve created from your categorical variables.)
The tool I’ve been taught to use is Pearson’s Product-Moment Correlation Coefficient, also called Pearson’s r. This is a quantitative tool which yields a coefficient that describes the slope of the line (as can best be determined) between your Y variable (say, ‘giving’) and one of your X’s (say, ‘class year’). A value of 1 denotes a perfectly linear correlation in a positive direction, and a value of -1 is a perfect negative correlation. All possible values fall between 1 and minus 1. Values near zero denote the absence of a linear correlation; they might be correlated in some other way, but not linearly.
A scatterplot of the two variables ‘giving‘ and ‘class year’, shown here, will reveal a relationship visually: ‘giving’ tends to decrease as ‘class year’ increases. This negative linear relationship is indicated by the downward-sloping line. The Pearson Correlation reveals the same thing, but not visually: it puts an actual number to it, and that number describes the slope of the line. Why is this important? Because for many of your variables, a scatterplot is just going to look like a mess – the linear relationship is in there somewhere, but it’s not evident from the cloud of points. If you have a calculated value instead, you can easily decide which linear relationships demand priority attention.
It’s easy to do in Data Desk. Just select the icon for ‘giving’ as your Y and also select all of your predictor variables (x), then go to Calc in the menu. Select Correlations, then Pearson Product-Moment. If you have a lot of variables, the resulting table will be impressively large. It will look fearsome or beautiful, depending on how you feel about being faced with a wall of numbers. (I think it’s gorgeous.) To find the value that relates one variable to another, find the intersection of the row and column of the two variables. For example, in the table below, the Pearson correlation value for ‘giving’ and ‘class year’ is -0.460. (The correlation of any variable with itself is, of course, a perfect 1.)
Have a look around this table. Don’t be concerned about the actual values. Just see which values are higher than others. For instance, look at the intersection of “position present” and “employer present“. It’s a very high value: 0.812, which is very close to 1! This tells us that these two predictor variables are going to “interact” with each other when we bring them into the regression analysis. It makes sense: Job title and employer name are likely to be present or absent in tandem, although not perfectly. The practical result is that one of these variables will prove to be a significant predictor, while the other adds little or nothing new, and will be left out.
So how do I decide which variable gets added before the other? It’s simple.
The only part of the whole Pearson table that we’re interested in is the column of values under the heading ‘giving’. Data Desk allows us to copy the table as text and paste it into Excel. When I do this and strip out all the stuff I’m not interested in, the result looks like this. (I’ve resorted the variables alphabetically).
Next, I sort the variables according to their Pearson correlation with Giving. The variables with the highest values will head the list. But notice a small problem: The strongest NEGATIVE variables end up at the very bottom. Really, with its high correlation with ‘giving’, the class year variable should rank first. So I do one extra step, creating a column with an Excel formula for the absolute value of the Pearson coefficient (i.e. without the minus sign), and re-sort on that value.
This gives me a clear idea of the order in which I should add variables to the regression. For example, ’employer present’ seems to edge out ‘position present’. Due to variable interaction, though, the final roster of which variables will stay and which will go is NOT evident at this point. The proof is in the regression, where all sorts of interesting and unforeseeable interactions may crop up.
You don’t have to take this manual approach to adding your variables – your software probably offers an automated, or partially automated, method called stepwise regression. But after all the work of preparing my predictors, I enjoy watching the way they interact with each other as I work through training the model. The way I see it, the more hands-on you are with your analysis, the more you learn about your data.
Final note: The examples above actually use a transformed value of ‘giving’ – the log of giving. Transforming our dependent variable using a logarithmic function is a perfectly valid way to make the linear relationships among variables much more evident. (Why we transform variables is explained more fully here.) If we used ‘giving’ just as it is, the Pearson values would be very low, which would indicate only a very weak linear relationship. Even ‘class year’ would have a low value, which we know isn’t a good description of the reality, which is better represented by the scatterplot above.