CoolData blog

14 March 2011

Correlation and you

Filed under: Correlation, Predictor variables, regression, Statistics — Tags: , , — kevinmacdonell @ 7:25 am

If you read books and blogs on statistics, eventually your understanding of even the most basic concepts will start to smear. Things we think ought to be well-established by now are matters of controversy. Ranking high on the list of most slippery concepts is correlation. In today’s post, I’m going to make the concept very complicated, and then I’m going to dismiss the complexity and make it all simple again. I’m telling you this in advance so I don’t lose you partway through.

Correlation is the foundation of predictive modeling. The degree to which one variable x, changes its value in relation to a second variable y, either positively or negatively, is the very definition of x‘s usefulness in predicting the value of y. The tool I use to quantify the strength of that relationship is Pearson Product-Moment Correlation, or Pearson’s r, which some statistics texts simply call “correlation”. (It would help if you read the blog post on Pearson before reading this one.)

Before we go any farther, I need to explain why I want to quantify the strength of relationships. At the start of any modeling project, I have a hundred or so potential predictor variables in my data file. Some are going to be excellent predictors, most will be only so-so, and others will have little or no association with the outcome variable. I want to introduce variables into the regression analysis in an order that makes sense, so that the best predictors are added first. Due to the complexity of interactions among variables, there is no telling in advance what will actually happen as variables are added, so any list of variables ordered by strength of correlation is merely a rough guide.

What I DON’T use Pearson’s r for much is exploring variables. At the exploration stage, before modeling begins, I will look at a variable a number of ways in order to get a sense of how valuable the variable will be, and how I might want to transform it or re-express it to make it better. I will look at how the variable is distributed, and compare average and median giving between groups (for example, Home Phone Present, Y/N). These and other techniques, as described in Peter Wylie’s book “Data Mining for Fundraisers,” are simpler, more direct, and often more helpful than abstract measures of correlation.

It’s only after I’ve done the exploration work and tweaked the variables for maximum effect that I’m ready to rank them in order by their correlation values. So, with that out of the way, let’s look at Pearson’s r in more detail.

The textbooks make it abundantly clear: Pearson’s r quantifies the relationship between two continuous variables that are linearly related. Right away, we’ve got a problem: Very few of the variables I work with are continuous, and most of the relationships I see do not meet the definition of “linear”. Yet, I use Pearson’s r exclusively. Does this mean I’ve been misusing and abusing the method?

I don’t see it that way, but it is interesting to read how these things are discussed in the literature.

You can assess whether a relationship between two variables is roughly linear by looking at a scatterplot of the variables. Below is a scatterplot of ‘Giving’ (log-transformed) and ‘Age’, created in Data Desk. It’s a big, messy cloud of points (some 80,000 of them!). A lot of the relationship is hidden by overplotting (the overlapping of points) along the bottom line — that row of points at the zero giving mark represents non-donors, and there are many more of them near the young end of that line than there are near the older end. Still, at least you can see a vague linear relationship in the upward fanning of the data: As age increases, so does lifetime giving. A best-fit line through the data would slope upward from left to right, and therefore the Pearson correlation value is high.

We’ve got two continuous variables, and a linear relationship, so we get the Statistician’s Seal of Approval: It’s okay to use Pearson’s r to measure strength of correlation between these two variables.

Unfortunately, as I said, most of the variables we use in predictive modeling are not continuous, and they don’t look like much of anything in a scatterplot. Here’s a scatterplot of a Likert-scale survey response. The survey question asked alumni how likely they are to donate to alma mater, and the scale runs from 1 to 5, with 5 being “very likely.” This is not a continuous variable, because there are no possible intermediate values among the five levels. It’s ordinal. The plot with Lifetime Giving is difficult to interpret, but it sure isn’t linear:

Yes, the line of points for the highest response, 5, does extend higher than any other line in terms of lifetime giving. But due to overplotting, there is no way to tell how many nondonors lurk in the single dots that appear at the foot of every line and which indicate zero dollars given. This in no way resembles a cloud of points through which one can imagine a best-fit line being drawn. I can TELL you that a positive response to this question is strongly associated with high levels of lifetime giving, and it is, but you could be forgiven for remaining unconvinced by this “evidence”.

Even worse are the most common predictor variables: indicator variables, in binary form (0/1). For example, let’s say I express the condition of being ‘Married’ as a binary variable. A scatterplot of ‘Giving’ and ‘Married’ is even less useful than the one for the survey question:

Yuck! We’ve got 80,000 data points all jammed up in two solid lines at zero (not married) and 1 (married). We can’t tell from this plot, but it just so happens that the not-married line contains a lot of points sitting at lifetime giving of zero, far more than the married line. Being married is associated with giving, but who could tell from this? There’s no way this relationship is linear.

One of the tests for the appropriateness of using Pearson’s r is whether a scatterplot of the variable looks like a “straight enough” line. Another test is that both variables are quantitative and continuous — not categorical (or ordinal) and discrete. The fact that I can take a categorical variable (Married) and re-express it as a number (0/1) makes no difference. Turning it into a number makes it possible to calculate Pearson’s r, but that doesn’t make it okay to do so.

So the textbooks tell us. What else do they have to say? Read on.

There are methods other than Pearson’s r which we can use to measure the degree of association between two variables, which do not require the presence of a linear relationship. One of these is Spearman’s Rho, which is the correlation between the ranks of two variables. Rho replaces the data values themselves with their ranks within each variable — so the lowest value in each variable becomes ‘1’, the next lowest becomes ‘2’, and so on — and then calculates to what degree those values are related between the two variables, either negatively or positively.

Spearman’s Rho is sometimes called Spearman Rank Correlation, which muddies the waters a bit as it implies that it’s a measure of correlation. It is, but the correlation is between the ranks, not the data values themselves. The bottom line is that a statistician would tell us that Spearman is the appropriate calculation for putting a value to the strength of association between two variables that are not linearly related but which may show a consistently increasing or decreasing trend. Unlike Pearson’s r, it is a nonparametric method — it is free of any requirement that the distribution of the variables look a certain way.

Spearman’s Rho has some special properties which I won’t get into, but overall it looks a lot like Pearson’s r. It can take on values between -1 and 1 (a perfect negative relationship to a perfect positive relationship, both called monotone relationships); values near zero indicate the absence of a relationship — just like Pearson’s r. And your stats software makes it equally easy to compute.

So, great. We’ve got Spearman’s Rho, which we are told is just the thing for analyzing the Likert scale variable I showed you earlier. What about the indicator variable (for ‘Married’)? Well, no, we’re told: You can’t use Pearson or Spearman’s to calculate correlation for categorical variables. For that, you need the point-biserial correlation coefficient.

Huh?

That’s right, another measure of correlation. In fact, there are many types of measures out there. I’ve got a sheet in front of me right now that lists more than eight different measures, the choice of which depends on the combination of variables you’re analyzing (two continuous variables, one continuous and one ordinal variable, two binary variables, one binary and one ordinal, etc. etc.). And that list is not exhaustive.

Another wrinkle is that some texts don’t call these relationships “correlation” at all. By a strict definition, if a relationship between variables it isn’t appropriate for Pearson’s r, then it ain’t correlation. We are supposed to call it by the more vague term “association.”

Hey, I’m cool with that. I’ll call it whatever you want. But what are the practical implications of all this?

As near as I can tell, ZERO.

Uh-huh, I’ve just made you read more than a thousand words on how complicated correlation is. Now I’m going to dismiss all of that with an imperial wave of my hand. I need to you ignore everything I’ve just said, for two reasons which I will elaborate on in a moment:

  1. As I stated at the outset, the only reason I calculate correlations is to explore which variables are most likely to figure prominently in a regression analysis. For a rough ranking of variables, Pearson’s r is a “good enough” tool.
  2. The correlation r is the basis of linear regression, which is our end-goal. Not Spearman, not point-biserial, nor any other measure.

Regarding point number one: We are not concerned about the precise value of the calculated association, only the approximate ranking of our variables. All we want to know is: which variables are probably most valuable for our model and should be added to the regression first? As it turns out, rankings using other measures of correlation (sorry — association) hardly vary from a Pearson’s r ranking. It’s extra bother for nothing.

And to point number two: There’s a real disconnect between what the textbooks say about correlation analysis and what they say about regression. It seems to me that if Pearson’s r is inappropriate for all but continuous, linearly-related variables, then we would also be told that only continuous, linearly-related variables can be used in regression. That’s not the case: Social-science researchers and modelers toss ordinal and binary variables into regressions with wild abandon. If we didn’t we’d have almost nothing left to work with.

The disconnect is bridged with an explanation I found in one university stats textbook. It’s touched on only briefly, and towards the end of the book. The gist is this: For 0/1 indicator variables added to a linear regression, the coefficient of correlation is not the slope of a line, as we are always told to understand it. The indicator acts to vertically shift the line, so that instead of one regression slope, we have two: The unshifted line if the indicator variable is equal to zero, and another line shifted vertically up or down (depending on the sign of the coefficient), if the value is 1. This seems like essential information, but hardly rates any discussion at all. That’s stats for you.

To summarize:

  1. Don’t be sidetracked by warnings regarding measures of association/correlation that have specific uses but do not relate to your end goal: Building a regression model for the pragmatic purpose of making predictions.
  2. Most of the time, scatterplots can’t tell you what you need to know, because most of our data is categorical.
  3. Indicator variables (and ordinal variables) are materially different from the pretty, linearly-related variables, and they are absolutely OK for use in regression.
  4. SAY “association”, but DO “correlation”.
  5. Don’t feel bad if you’re having difficulty learning predictive modeling from a stats textbook. I can’t see how anyone could.

3 September 2010

More on making hay from variables that interact

A short while ago I wrote about pairs of predictor variables that are highly correlated with each other, i.e. that have strong interactions in regression analysis. (Making hay when predictor variables interact.) The example I used was Position Present and Employer Name Present. Instead of using one and throwing the other away as redundant, you can combine them to form a new variable with more predictive power than either of the original two on their own.

In this post, I’ll show you how to identify other likely pairs of variables from which you can try to make similar “combination variables.”

When independent variables interact in regression, it’s called multicollinearity. (You can read a good discussion of multicollinearity on the Stats Make Me Cry blog: Top Ten Confusing Stats Terms Explained in Plain English, #9: Multicollinearity.) Position Present and Employer Name Present is an obvious example, but all kinds of subtle combinations are possible and difficult to foresee. We need to call on some help in detecting the interaction. That help is provided by Pearson’s Product-Moment Correlation, also known as Pearson’s r. I’ve written about Pearson’s r before.

In a nutshell, Pearson’s r calculates a number that describes the strength of linear correlation between any two variables. Your stats software makes this easy. In DataDesk, I select all the icons of the variables I want to calculate correlations for, then find Pearson’s r in the menu. The result is a new window containing a table full of values. If you include many variables at once, this table will be massive. Click on the image below for a full-size version of a Pearson table based on some real data from a university. (You might have to enlarge it in your browser.) The table works exactly like those distance tables you find on old tourist highway maps (they don’t seem to make those anymore — wonder why); to find the distance from, say Albuquerque to Santa Fe, you’d find the number at the intersection of the Albuquerque column with the Santa Fe row, and that would be the number of miles to travel. In the table below, the cities are variables and the mileage is Pearson’s r.

Don’t be intimidated by all the numbers! Just let your eye wander over them. Notice that some are positive, some negative. The negative sign simply means that the correlation between the two variables is negative. Notice also that most of the numbers are small, less than 0.1 (or minus 0.1). As far as multicollinearity is concerned, we’re most interested in these bigger values, i.e. values that are furthest from zero (no correlation) and closest to 1 (perfect linear correlation).

I realize some of the variable names will be a bit mysterious, but you might be able to guess that “Number deg” is Number of Degrees and that “Grad Y” means Graduated. Their Pearson’s r value (0.26) is one of the higher correlations, which makes sense, right?

Noticing certain correlations can teach you things about the data. ‘Female’ is correlated with ‘Class (Year)’ — because at this university, males outnumbered females years ago, but since the 1980s, females have outnumbered males by an ever-increasing factor. On the other hand, ‘Number HC’ (campus reunions attended) is negatively correlated with ‘Class Year’ — older alumni have attended more events (no big surprise), but also young alumni are not big on reunions at this institution.

Look at ‘Business Phone Present’ and ‘Employer Present’. Their r value is relatively high (0.376). I would test some variations of those two. You could add them together, so that the variable could range from 0 to 2. Or you could multiply them, to give you a binary variable that would have a value of 1 only if both of the original variables was 1. You might end up with a predictor that is more highly correlated with ‘Giving’ than either of the original two variables.

With non-binary variables such as ‘Class Year’ and ‘Number Events Attended’, the results of combining will be even more varied and interesting. What you do is up to you; there’s no harm in trying. When you’re done playing, just rank all your old and new variables in order by the absolute value of their strength of linear correlation with your predicted value (say, Giving), and see how the new variables fare.

9 June 2010

Why multiple regression?

Filed under: Model building, regression, Statistics — Tags: , , , — kevinmacdonell @ 6:11 am

Not long ago I wrote about Pearson’s r, also known as Pearson’s Product-Moment Correlation Coefficient. This is a convenient statistical tool available in any stats software program (Excel can calculate it too) that yields a numerical measure of the strength of the correlation (linear dependence) between any two variables, X and Y.

I will show you how you can easily create a predictive score using only Pearson’s r — and why you probably shouldn’t!

Pearson’s r points the way toward weighting your predictor variables, according to how strongly correlated with your predicted value they actually are. If you assume all your predictor variables are worth the same (“1” for positive predictors and “-1” for negative predictors) you are imposing a subjective weighting on your variables. (I wrote about this limitation of the simple score in my most recent post, Beyond the simple score.)

Is Homecoming attendance more or less predictive than presence of an email address? Pearson’s r will tell you. Look at the example in the table above. Ten common predictor variables are listed in order of their strength of correlation with Lifetime Giving (log-transformed). It just so happens that the highest two correlations are negative; even though they are negative, according to their absolute value they are stronger correlations than any of the others in the list, so I put them at the top.  According to these values, “Email present” is relatively weak compared with “Number of Homecomings attended.”

Do you see where this is going? If you wanted to, you could use these correlations to directly create weighted scores for everyone in your database. Just multiply each variable by its Pearson value, sum up the products, and bingo — there’s your score.

You could do that, but I don’t think you should.

At least two of the predictor variables we want to use in our score are very closely related to each other: Employer present and Position present. They aren’t exactly alike: For some constituents you will have one piece of information and not the other. But on the whole, if one is present in your database for any given constituent, chances are you’ve got the other as well.

In other words, if you include both variables in your score, you’re double-counting the effect of employment information in your model — despite the fact that each is properly weighted by Pearson score. The reason is that Pearson’s r treats only two variables at a time: X and Y. It does NOT account for any interactions between multiple Xs.

Employment variables are only an obvious example. All of your variables will interact with each other to some degree, some strongly, others more subtly. By “interact with each other” I mean “correlate with each other.” In fact, we can use Pearson’s r to show which combinations of predictor variables are strongly correlated. The table below lists three variable pairs, drawn from real data, that exhibit strong interactions — including the employment example we’ve just mentioned.

The Pearson value for the employment variables is very close to 1, which indicates a nearly perfect positive correlation. The other two are more subtle, but make sense as well: Younger alumni will tend to be coded as Single in your database, and if we have a job title for someone, chances are we’ll also have a business phone number as well.

This overlapping of the explanatory effect of various X’s on Y will interfere with our ability to properly weight our predictors. Pearson’s Product-Moment Correlation Coefficient is important for understanding our variables, but not quite up to the task of directly created predictive scores. What now?

Well — multiple regression! Only regression will account for interactions among our predictor Xs, recalculating the coefficients (weightings) on the fly each time we add a new predictor variable. Working from the Pearson list at the top of this post, we would add Class year, Single, and Employer present to our regression window one by one. Everything would be fine up to that point; our p-values will be very low for these variables. When we add Position present, however, the p-value will be too high (0.183, which exceeds the rule-of-thumb value of 0.05), and R squared will fail to improve. We would therefore leave Position present out of the regression because it isn’t adding any new predictive information to the model and might interfere with the effectiveness of other variables in subtle and strange ways.

Often when I use the word “regression” on someone, what I see reflected back in their eyes is fear. (I really need to reserve that word for people I don’t like.) I wish, though, that people could see that regression is a bit like an automobile: A complex machine with many moving parts, but familiar and approachable, with a simple and comprehensible purpose, and above all operable by just about anyone.

28 April 2010

Pearson product-moment correlation coefficient

Filed under: Data Desk, Model building, regression, Statistics — Tags: , , , — kevinmacdonell @ 8:54 am

You can’t have a serious blog post related to statistics without tossing in the name of a dead white guy. How dead? Well, yesterday was the 74th anniversary of Karl Pearson‘s death in Surrey, England. How white? Pearson was a fan of eugenics, social Darwinism, and the “struggle of race with race” – with the supposedly best race winning. Charming! Like him or not, his contribution to statistics (and therefore science) was huge.

I’ve written a bit about multiple linear regression, but not a great deal about how to do it. After I’ve got my data file ready and before I open up a regression window in Data Desk, ranking my predictor variables using ol’ Pearson’s measure of correlation is Step One. The object is to rank your predictor variables (a.k.a. independent variables) according to the strength (either positive or negative) of their linear correlation with your predicted value (a.k.a. dependent variable). I do this in order to determine the order in which I will introduce my predictor variables into my regression analysis. (All of the following discussion assumes that your variables are numerical; either continuous variables such as ‘class year’, or 0/1 indicator variables you’ve created from your categorical variables.)

The tool I’ve been taught to use is Pearson’s Product-Moment Correlation Coefficient, also called Pearson’s r. This is a quantitative tool which yields a coefficient that describes the slope of the line (as can best be determined) between your Y variable (say, ‘giving’) and one of your X’s (say, ‘class year’). A value of 1 denotes a perfectly linear correlation in a positive direction, and a value of -1 is a perfect negative correlation. All possible values fall between 1 and minus 1. Values near zero denote the absence of a linear correlation; they might be correlated in some other way, but not linearly.

(Click for larger view.)

A scatterplot of the two variables ‘giving‘ and ‘class year’, shown here, will reveal a relationship visually: ‘giving’ tends to decrease as ‘class year’ increases. This negative linear relationship is indicated by the downward-sloping line. The Pearson Correlation reveals the same thing, but not visually: it puts an actual number to it, and that number describes the slope of the line. Why is this important? Because for many of your variables, a scatterplot is just going to look like a mess – the linear relationship is in there somewhere, but it’s not evident from the cloud of points. If you have a calculated value instead, you can easily decide which linear relationships demand priority attention.

It’s easy to do in Data Desk. Just select the icon for ‘giving’ as your Y and also select all of your predictor variables (x), then go to Calc in the menu. Select Correlations, then Pearson Product-Moment. If you have a lot of variables, the resulting table will be impressively large. It will look fearsome or beautiful, depending on how you feel about being faced with a wall of numbers. (I think it’s gorgeous.) To find the value that relates one variable to another, find the intersection of the row and column of the two variables. For example, in the table below, the Pearson correlation value for ‘giving’ and ‘class year’ is -0.460. (The correlation of any variable with itself is, of course, a perfect 1.)

Have a look around this table. Don’t be concerned about the actual values. Just see which values are higher than others. For instance, look at the intersection of “position present” and “employer present“. It’s a very high value: 0.812, which is very close to 1! This tells us that these two predictor variables are going to “interact” with each other when we bring them into the regression analysis. It makes sense: Job title and employer name are likely to be present or absent in tandem, although not perfectly. The practical result is that one of these variables will prove to be a significant predictor, while the other adds little or nothing new, and will be left out.

So how do I decide which variable gets added before the other? It’s simple.

The only part of the whole Pearson table that we’re interested in is the column of values under the heading ‘giving’. Data Desk allows us to copy the table as text and paste it into Excel. When I do this and strip out all the stuff I’m not interested in, the result looks like this. (I’ve resorted the variables alphabetically).

Next, I sort the variables according to their Pearson correlation with Giving. The variables with the highest values will head the list. But notice a small problem: The strongest NEGATIVE variables end up at the very bottom. Really, with its high correlation with ‘giving’, the class year variable should rank first. So I do one extra step, creating a column with an Excel formula for the absolute value of the Pearson coefficient (i.e. without the minus sign), and re-sort on that value.

This gives me a clear idea of the order in which I should add variables to the regression. For example, ’employer present’ seems to edge out ‘position present’. Due to variable interaction, though, the final roster of which variables will stay and which will go is NOT evident at this point. The proof is in the regression, where all sorts of interesting and unforeseeable interactions may crop up.

You don’t have to take this manual approach to adding your variables – your software probably offers an automated, or partially automated, method called stepwise regression. But after all the work of preparing my predictors, I enjoy watching the way they interact with each other as I work through training the model. The way I see it, the more hands-on you are with your analysis, the more you learn about your data.

Final note: The examples above actually use a transformed value of ‘giving’ – the log of giving. Transforming our dependent variable using a logarithmic function is a perfectly valid way to make the linear relationships among variables much more evident. (Why we transform variables is explained more fully here.) If we used ‘giving’ just as it is, the Pearson values would be very low, which would indicate only a very weak linear relationship. Even ‘class year’ would have a low value, which we know isn’t a good description of the reality, which is better represented by the scatterplot above.

Blog at WordPress.com.