If you read books and blogs on statistics, eventually your understanding of even the most basic concepts will start to smear. Things we think ought to be well-established by now are matters of controversy. Ranking high on the list of most slippery concepts is correlation. In today’s post, I’m going to make the concept very complicated, and then I’m going to dismiss the complexity and make it all simple again. I’m telling you this in advance so I don’t lose you partway through.
Correlation is the foundation of predictive modeling. The degree to which one variable x, changes its value in relation to a second variable y, either positively or negatively, is the very definition of x‘s usefulness in predicting the value of y. The tool I use to quantify the strength of that relationship is Pearson Product-Moment Correlation, or Pearson’s r, which some statistics texts simply call “correlation”. (It would help if you read the blog post on Pearson before reading this one.)
Before we go any farther, I need to explain why I want to quantify the strength of relationships. At the start of any modeling project, I have a hundred or so potential predictor variables in my data file. Some are going to be excellent predictors, most will be only so-so, and others will have little or no association with the outcome variable. I want to introduce variables into the regression analysis in an order that makes sense, so that the best predictors are added first. Due to the complexity of interactions among variables, there is no telling in advance what will actually happen as variables are added, so any list of variables ordered by strength of correlation is merely a rough guide.
What I DON’T use Pearson’s r for much is exploring variables. At the exploration stage, before modeling begins, I will look at a variable a number of ways in order to get a sense of how valuable the variable will be, and how I might want to transform it or re-express it to make it better. I will look at how the variable is distributed, and compare average and median giving between groups (for example, Home Phone Present, Y/N). These and other techniques, as described in Peter Wylie’s book “Data Mining for Fundraisers,” are simpler, more direct, and often more helpful than abstract measures of correlation.
It’s only after I’ve done the exploration work and tweaked the variables for maximum effect that I’m ready to rank them in order by their correlation values. So, with that out of the way, let’s look at Pearson’s r in more detail.
The textbooks make it abundantly clear: Pearson’s r quantifies the relationship between two continuous variables that are linearly related. Right away, we’ve got a problem: Very few of the variables I work with are continuous, and most of the relationships I see do not meet the definition of “linear”. Yet, I use Pearson’s r exclusively. Does this mean I’ve been misusing and abusing the method?
I don’t see it that way, but it is interesting to read how these things are discussed in the literature.
You can assess whether a relationship between two variables is roughly linear by looking at a scatterplot of the variables. Below is a scatterplot of ‘Giving’ (log-transformed) and ‘Age’, created in Data Desk. It’s a big, messy cloud of points (some 80,000 of them!). A lot of the relationship is hidden by overplotting (the overlapping of points) along the bottom line — that row of points at the zero giving mark represents non-donors, and there are many more of them near the young end of that line than there are near the older end. Still, at least you can see a vague linear relationship in the upward fanning of the data: As age increases, so does lifetime giving. A best-fit line through the data would slope upward from left to right, and therefore the Pearson correlation value is high.
We’ve got two continuous variables, and a linear relationship, so we get the Statistician’s Seal of Approval: It’s okay to use Pearson’s r to measure strength of correlation between these two variables.
Unfortunately, as I said, most of the variables we use in predictive modeling are not continuous, and they don’t look like much of anything in a scatterplot. Here’s a scatterplot of a Likert-scale survey response. The survey question asked alumni how likely they are to donate to alma mater, and the scale runs from 1 to 5, with 5 being “very likely.” This is not a continuous variable, because there are no possible intermediate values among the five levels. It’s ordinal. The plot with Lifetime Giving is difficult to interpret, but it sure isn’t linear:
Yes, the line of points for the highest response, 5, does extend higher than any other line in terms of lifetime giving. But due to overplotting, there is no way to tell how many nondonors lurk in the single dots that appear at the foot of every line and which indicate zero dollars given. This in no way resembles a cloud of points through which one can imagine a best-fit line being drawn. I can TELL you that a positive response to this question is strongly associated with high levels of lifetime giving, and it is, but you could be forgiven for remaining unconvinced by this “evidence”.
Even worse are the most common predictor variables: indicator variables, in binary form (0/1). For example, let’s say I express the condition of being ‘Married’ as a binary variable. A scatterplot of ‘Giving’ and ‘Married’ is even less useful than the one for the survey question:
Yuck! We’ve got 80,000 data points all jammed up in two solid lines at zero (not married) and 1 (married). We can’t tell from this plot, but it just so happens that the not-married line contains a lot of points sitting at lifetime giving of zero, far more than the married line. Being married is associated with giving, but who could tell from this? There’s no way this relationship is linear.
One of the tests for the appropriateness of using Pearson’s r is whether a scatterplot of the variable looks like a “straight enough” line. Another test is that both variables are quantitative and continuous — not categorical (or ordinal) and discrete. The fact that I can take a categorical variable (Married) and re-express it as a number (0/1) makes no difference. Turning it into a number makes it possible to calculate Pearson’s r, but that doesn’t make it okay to do so.
So the textbooks tell us. What else do they have to say? Read on.
There are methods other than Pearson’s r which we can use to measure the degree of association between two variables, which do not require the presence of a linear relationship. One of these is Spearman’s Rho, which is the correlation between the ranks of two variables. Rho replaces the data values themselves with their ranks within each variable — so the lowest value in each variable becomes ‘1’, the next lowest becomes ‘2’, and so on — and then calculates to what degree those values are related between the two variables, either negatively or positively.
Spearman’s Rho is sometimes called Spearman Rank Correlation, which muddies the waters a bit as it implies that it’s a measure of correlation. It is, but the correlation is between the ranks, not the data values themselves. The bottom line is that a statistician would tell us that Spearman is the appropriate calculation for putting a value to the strength of association between two variables that are not linearly related but which may show a consistently increasing or decreasing trend. Unlike Pearson’s r, it is a nonparametric method — it is free of any requirement that the distribution of the variables look a certain way.
Spearman’s Rho has some special properties which I won’t get into, but overall it looks a lot like Pearson’s r. It can take on values between -1 and 1 (a perfect negative relationship to a perfect positive relationship, both called monotone relationships); values near zero indicate the absence of a relationship — just like Pearson’s r. And your stats software makes it equally easy to compute.
So, great. We’ve got Spearman’s Rho, which we are told is just the thing for analyzing the Likert scale variable I showed you earlier. What about the indicator variable (for ‘Married’)? Well, no, we’re told: You can’t use Pearson or Spearman’s to calculate correlation for categorical variables. For that, you need the point-biserial correlation coefficient.
That’s right, another measure of correlation. In fact, there are many types of measures out there. I’ve got a sheet in front of me right now that lists more than eight different measures, the choice of which depends on the combination of variables you’re analyzing (two continuous variables, one continuous and one ordinal variable, two binary variables, one binary and one ordinal, etc. etc.). And that list is not exhaustive.
Another wrinkle is that some texts don’t call these relationships “correlation” at all. By a strict definition, if a relationship between variables it isn’t appropriate for Pearson’s r, then it ain’t correlation. We are supposed to call it by the more vague term “association.”
Hey, I’m cool with that. I’ll call it whatever you want. But what are the practical implications of all this?
As near as I can tell, ZERO.
Uh-huh, I’ve just made you read more than a thousand words on how complicated correlation is. Now I’m going to dismiss all of that with an imperial wave of my hand. I need to you ignore everything I’ve just said, for two reasons which I will elaborate on in a moment:
- As I stated at the outset, the only reason I calculate correlations is to explore which variables are most likely to figure prominently in a regression analysis. For a rough ranking of variables, Pearson’s r is a “good enough” tool.
- The correlation r is the basis of linear regression, which is our end-goal. Not Spearman, not point-biserial, nor any other measure.
Regarding point number one: We are not concerned about the precise value of the calculated association, only the approximate ranking of our variables. All we want to know is: which variables are probably most valuable for our model and should be added to the regression first? As it turns out, rankings using other measures of correlation (sorry — association) hardly vary from a Pearson’s r ranking. It’s extra bother for nothing.
And to point number two: There’s a real disconnect between what the textbooks say about correlation analysis and what they say about regression. It seems to me that if Pearson’s r is inappropriate for all but continuous, linearly-related variables, then we would also be told that only continuous, linearly-related variables can be used in regression. That’s not the case: Social-science researchers and modelers toss ordinal and binary variables into regressions with wild abandon. If we didn’t we’d have almost nothing left to work with.
The disconnect is bridged with an explanation I found in one university stats textbook. It’s touched on only briefly, and towards the end of the book. The gist is this: For 0/1 indicator variables added to a linear regression, the coefficient of correlation is not the slope of a line, as we are always told to understand it. The indicator acts to vertically shift the line, so that instead of one regression slope, we have two: The unshifted line if the indicator variable is equal to zero, and another line shifted vertically up or down (depending on the sign of the coefficient), if the value is 1. This seems like essential information, but hardly rates any discussion at all. That’s stats for you.
- Don’t be sidetracked by warnings regarding measures of association/correlation that have specific uses but do not relate to your end goal: Building a regression model for the pragmatic purpose of making predictions.
- Most of the time, scatterplots can’t tell you what you need to know, because most of our data is categorical.
- Indicator variables (and ordinal variables) are materially different from the pretty, linearly-related variables, and they are absolutely OK for use in regression.
- SAY “association”, but DO “correlation”.
- Don’t feel bad if you’re having difficulty learning predictive modeling from a stats textbook. I can’t see how anyone could.