You can’t have a serious blog post related to statistics without tossing in the name of a dead white guy. How dead? Well, yesterday was the 74th anniversary of Karl Pearson‘s death in Surrey, England. How white? Pearson was a fan of eugenics, social Darwinism, and the “struggle of race with race” – with the supposedly best race winning. Charming! Like him or not, his contribution to statistics (and therefore science) was huge.
I’ve written a bit about multiple linear regression, but not a great deal about how to do it. After I’ve got my data file ready and before I open up a regression window in Data Desk, ranking my predictor variables using ol’ Pearson’s measure of correlation is Step One. The object is to rank your predictor variables (a.k.a. independent variables) according to the strength (either positive or negative) of their linear correlation with your predicted value (a.k.a. dependent variable). I do this in order to determine the order in which I will introduce my predictor variables into my regression analysis. (All of the following discussion assumes that your variables are numerical; either continuous variables such as ‘class year’, or 0/1 indicator variables you’ve created from your categorical variables.)
The tool I’ve been taught to use is Pearson’s Product-Moment Correlation Coefficient, also called Pearson’s r. This is a quantitative tool which yields a coefficient that describes the slope of the line (as can best be determined) between your Y variable (say, ‘giving’) and one of your X’s (say, ‘class year’). A value of 1 denotes a perfectly linear correlation in a positive direction, and a value of -1 is a perfect negative correlation. All possible values fall between 1 and minus 1. Values near zero denote the absence of a linear correlation; they might be correlated in some other way, but not linearly.
A scatterplot of the two variables ‘giving‘ and ‘class year’, shown here, will reveal a relationship visually: ‘giving’ tends to decrease as ‘class year’ increases. This negative linear relationship is indicated by the downward-sloping line. The Pearson Correlation reveals the same thing, but not visually: it puts an actual number to it, and that number describes the slope of the line. Why is this important? Because for many of your variables, a scatterplot is just going to look like a mess – the linear relationship is in there somewhere, but it’s not evident from the cloud of points. If you have a calculated value instead, you can easily decide which linear relationships demand priority attention.
It’s easy to do in Data Desk. Just select the icon for ‘giving’ as your Y and also select all of your predictor variables (x), then go to Calc in the menu. Select Correlations, then Pearson Product-Moment. If you have a lot of variables, the resulting table will be impressively large. It will look fearsome or beautiful, depending on how you feel about being faced with a wall of numbers. (I think it’s gorgeous.) To find the value that relates one variable to another, find the intersection of the row and column of the two variables. For example, in the table below, the Pearson correlation value for ‘giving’ and ‘class year’ is -0.460. (The correlation of any variable with itself is, of course, a perfect 1.)
Have a look around this table. Don’t be concerned about the actual values. Just see which values are higher than others. For instance, look at the intersection of “position present” and “employer present“. It’s a very high value: 0.812, which is very close to 1! This tells us that these two predictor variables are going to “interact” with each other when we bring them into the regression analysis. It makes sense: Job title and employer name are likely to be present or absent in tandem, although not perfectly. The practical result is that one of these variables will prove to be a significant predictor, while the other adds little or nothing new, and will be left out.
So how do I decide which variable gets added before the other? It’s simple.
The only part of the whole Pearson table that we’re interested in is the column of values under the heading ‘giving’. Data Desk allows us to copy the table as text and paste it into Excel. When I do this and strip out all the stuff I’m not interested in, the result looks like this. (I’ve resorted the variables alphabetically).
Next, I sort the variables according to their Pearson correlation with Giving. The variables with the highest values will head the list. But notice a small problem: The strongest NEGATIVE variables end up at the very bottom. Really, with its high correlation with ‘giving’, the class year variable should rank first. So I do one extra step, creating a column with an Excel formula for the absolute value of the Pearson coefficient (i.e. without the minus sign), and re-sort on that value.
This gives me a clear idea of the order in which I should add variables to the regression. For example, ’employer present’ seems to edge out ‘position present’. Due to variable interaction, though, the final roster of which variables will stay and which will go is NOT evident at this point. The proof is in the regression, where all sorts of interesting and unforeseeable interactions may crop up.
You don’t have to take this manual approach to adding your variables – your software probably offers an automated, or partially automated, method called stepwise regression. But after all the work of preparing my predictors, I enjoy watching the way they interact with each other as I work through training the model. The way I see it, the more hands-on you are with your analysis, the more you learn about your data.
Final note: The examples above actually use a transformed value of ‘giving’ – the log of giving. Transforming our dependent variable using a logarithmic function is a perfectly valid way to make the linear relationships among variables much more evident. (Why we transform variables is explained more fully here.) If we used ‘giving’ just as it is, the Pearson values would be very low, which would indicate only a very weak linear relationship. Even ‘class year’ would have a low value, which we know isn’t a good description of the reality, which is better represented by the scatterplot above.
I had no idea Pearson was so, uh,…charming!
I agree, though, how important bivariate correlations are before running a regression model. Every time I skip it, I get myself into trouble. Univariate statistics too.
I also try to encourage people to think of some pre-analysis steps as part of the analysis. Makes the later analysis steps easier. Here’s an outline: The 11 Steps for Statistical Modeling in any Regression or ANOVA http://www.analysisfactor.com/statchat/?p=671.
Comment by Karen Grace-Martin — 28 April 2010 @ 11:28 am
Thanks Karen … I encourage people to check out Karen’s site and her various stats-related resources, including her webinars. I would have to agree: Ranking your predictors isn’t really “step one” … there are important steps leading up to the analysis itself, including defining the right question that you’re trying to answer.
Comment by kevinmacdonell — 28 April 2010 @ 11:40 am
Really great blog Kevin! I think people get excited to explore things with regression and other types of analysis and they often blog right by more basic things that can REALLY inform those analysis.
As I read the article, it made me think of things like suppression and how it plays into these preliminary stages. How would you recommend that one would consider potential suppressor effects, given that those type of relationships may not be clearly represented in something like a correlation? Once again, great blog, keep em’ coming!
Comment by Stats Make Me Cry Guy (Jeremy) — 28 April 2010 @ 5:16 pm
Hi Jeremy … Suppression is not something I worry about too much. Yes, with the variables we work with, there is a lot of redundancy in their explanatory effects. But at least for universities with robust databases and the means to extract relevant variables, there’s an embarrassment of riches. If I suspect a variable is going to cause trouble in my model, i.e. interfering with the significance of some of my better predictors or reversing the direction of their correlation, I’ll just leave it out. My grasp of the issue is a little uncertain, but I think suppressor variables might fall into that category. I’m of two minds, though: Sometimes I will leave these variables in as not being particularly harmful, and sometimes I’ll leave them out if I don’t like what I’m seeing. My reading leads me to believe that suppressor variables are of more concern when one is building causal models. My aim is purely predictive – I’m not trying to isolate what causes what. (Can you tell I’m dodging your question??)
Comment by kevinmacdonell — 29 April 2010 @ 7:45 pm
Of course, as long as you’re transforming variables (such as taking the log of giving), you could subtract the class year from the current year to make it a “years-since-graduation” variable. That would turn the negative correlation with giving into a positive one, saving you from having to fool with the absolute value step.
All a matter of personal preference, really.
– Jeff
Comment by Jeff Jetton — 29 April 2010 @ 11:37 am
Quite right! And 0/1 negatively-correlated variables, such as ‘marital status is single’, could simply be restated in reverse.
Comment by kevinmacdonell — 29 April 2010 @ 7:02 pm
[…] Pearson's r, regression, statistics — kevinmacdonell @ 6:11 am Not long ago I wrote about Pearson’s r, also known as Pearson’s Product-Moment Correlation Coefficient. This is a convenient […]
Pingback by Why multiple regression? « CoolData blog — 9 June 2010 @ 6:12 am
[…] When independent variables interact in regression, it’s called multicollinearity. (You can read a good discussion of multicollinearity on the Stats Make Me Cry blog: Top Ten Confusing Stats Terms Explained in Plain English, #9: Multicollinearity.) Position Present and Employer Name Present is an obvious example, but all kinds of subtle combinations are possible and difficult to foresee. We need to call on some help in detecting the interaction. That help is provided by Pearson’s Product-Moment Correlation, also known as Pearson’s r. I’ve written about Pearson’s r before. […]
Pingback by More on making hay from variables that interact « CoolData blog — 3 September 2010 @ 6:03 am
[…] scoop on Pearson’s r I stumbled across the CoolData blog. Right at the start of the post on the subject the writer says “You can’t have a serious blog post related to statistics without tossing […]
Pingback by A Word of Praise « An Honest Con — 28 October 2010 @ 6:53 am