CoolData blog

14 March 2011

Correlation and you

Filed under: Correlation, Predictor variables, regression, Statistics — Tags: , , — kevinmacdonell @ 7:25 am

If you read books and blogs on statistics, eventually your understanding of even the most basic concepts will start to smear. Things we think ought to be well-established by now are matters of controversy. Ranking high on the list of most slippery concepts is correlation. In today’s post, I’m going to make the concept very complicated, and then I’m going to dismiss the complexity and make it all simple again. I’m telling you this in advance so I don’t lose you partway through.

Correlation is the foundation of predictive modeling. The degree to which one variable x, changes its value in relation to a second variable y, either positively or negatively, is the very definition of x‘s usefulness in predicting the value of y. The tool I use to quantify the strength of that relationship is Pearson Product-Moment Correlation, or Pearson’s r, which some statistics texts simply call “correlation”. (It would help if you read the blog post on Pearson before reading this one.)

Before we go any farther, I need to explain why I want to quantify the strength of relationships. At the start of any modeling project, I have a hundred or so potential predictor variables in my data file. Some are going to be excellent predictors, most will be only so-so, and others will have little or no association with the outcome variable. I want to introduce variables into the regression analysis in an order that makes sense, so that the best predictors are added first. Due to the complexity of interactions among variables, there is no telling in advance what will actually happen as variables are added, so any list of variables ordered by strength of correlation is merely a rough guide.

What I DON’T use Pearson’s r for much is exploring variables. At the exploration stage, before modeling begins, I will look at a variable a number of ways in order to get a sense of how valuable the variable will be, and how I might want to transform it or re-express it to make it better. I will look at how the variable is distributed, and compare average and median giving between groups (for example, Home Phone Present, Y/N). These and other techniques, as described in Peter Wylie’s book “Data Mining for Fundraisers,” are simpler, more direct, and often more helpful than abstract measures of correlation.

It’s only after I’ve done the exploration work and tweaked the variables for maximum effect that I’m ready to rank them in order by their correlation values. So, with that out of the way, let’s look at Pearson’s r in more detail.

The textbooks make it abundantly clear: Pearson’s r quantifies the relationship between two continuous variables that are linearly related. Right away, we’ve got a problem: Very few of the variables I work with are continuous, and most of the relationships I see do not meet the definition of “linear”. Yet, I use Pearson’s r exclusively. Does this mean I’ve been misusing and abusing the method?

I don’t see it that way, but it is interesting to read how these things are discussed in the literature.

You can assess whether a relationship between two variables is roughly linear by looking at a scatterplot of the variables. Below is a scatterplot of ‘Giving’ (log-transformed) and ‘Age’, created in Data Desk. It’s a big, messy cloud of points (some 80,000 of them!). A lot of the relationship is hidden by overplotting (the overlapping of points) along the bottom line — that row of points at the zero giving mark represents non-donors, and there are many more of them near the young end of that line than there are near the older end. Still, at least you can see a vague linear relationship in the upward fanning of the data: As age increases, so does lifetime giving. A best-fit line through the data would slope upward from left to right, and therefore the Pearson correlation value is high.

We’ve got two continuous variables, and a linear relationship, so we get the Statistician’s Seal of Approval: It’s okay to use Pearson’s r to measure strength of correlation between these two variables.

Unfortunately, as I said, most of the variables we use in predictive modeling are not continuous, and they don’t look like much of anything in a scatterplot. Here’s a scatterplot of a Likert-scale survey response. The survey question asked alumni how likely they are to donate to alma mater, and the scale runs from 1 to 5, with 5 being “very likely.” This is not a continuous variable, because there are no possible intermediate values among the five levels. It’s ordinal. The plot with Lifetime Giving is difficult to interpret, but it sure isn’t linear:

Yes, the line of points for the highest response, 5, does extend higher than any other line in terms of lifetime giving. But due to overplotting, there is no way to tell how many nondonors lurk in the single dots that appear at the foot of every line and which indicate zero dollars given. This in no way resembles a cloud of points through which one can imagine a best-fit line being drawn. I can TELL you that a positive response to this question is strongly associated with high levels of lifetime giving, and it is, but you could be forgiven for remaining unconvinced by this “evidence”.

Even worse are the most common predictor variables: indicator variables, in binary form (0/1). For example, let’s say I express the condition of being ‘Married’ as a binary variable. A scatterplot of ‘Giving’ and ‘Married’ is even less useful than the one for the survey question:

Yuck! We’ve got 80,000 data points all jammed up in two solid lines at zero (not married) and 1 (married). We can’t tell from this plot, but it just so happens that the not-married line contains a lot of points sitting at lifetime giving of zero, far more than the married line. Being married is associated with giving, but who could tell from this? There’s no way this relationship is linear.

One of the tests for the appropriateness of using Pearson’s r is whether a scatterplot of the variable looks like a “straight enough” line. Another test is that both variables are quantitative and continuous — not categorical (or ordinal) and discrete. The fact that I can take a categorical variable (Married) and re-express it as a number (0/1) makes no difference. Turning it into a number makes it possible to calculate Pearson’s r, but that doesn’t make it okay to do so.

So the textbooks tell us. What else do they have to say? Read on.

There are methods other than Pearson’s r which we can use to measure the degree of association between two variables, which do not require the presence of a linear relationship. One of these is Spearman’s Rho, which is the correlation between the ranks of two variables. Rho replaces the data values themselves with their ranks within each variable — so the lowest value in each variable becomes ‘1’, the next lowest becomes ‘2’, and so on — and then calculates to what degree those values are related between the two variables, either negatively or positively.

Spearman’s Rho is sometimes called Spearman Rank Correlation, which muddies the waters a bit as it implies that it’s a measure of correlation. It is, but the correlation is between the ranks, not the data values themselves. The bottom line is that a statistician would tell us that Spearman is the appropriate calculation for putting a value to the strength of association between two variables that are not linearly related but which may show a consistently increasing or decreasing trend. Unlike Pearson’s r, it is a nonparametric method — it is free of any requirement that the distribution of the variables look a certain way.

Spearman’s Rho has some special properties which I won’t get into, but overall it looks a lot like Pearson’s r. It can take on values between -1 and 1 (a perfect negative relationship to a perfect positive relationship, both called monotone relationships); values near zero indicate the absence of a relationship — just like Pearson’s r. And your stats software makes it equally easy to compute.

So, great. We’ve got Spearman’s Rho, which we are told is just the thing for analyzing the Likert scale variable I showed you earlier. What about the indicator variable (for ‘Married’)? Well, no, we’re told: You can’t use Pearson or Spearman’s to calculate correlation for categorical variables. For that, you need the point-biserial correlation coefficient.

Huh?

That’s right, another measure of correlation. In fact, there are many types of measures out there. I’ve got a sheet in front of me right now that lists more than eight different measures, the choice of which depends on the combination of variables you’re analyzing (two continuous variables, one continuous and one ordinal variable, two binary variables, one binary and one ordinal, etc. etc.). And that list is not exhaustive.

Another wrinkle is that some texts don’t call these relationships “correlation” at all. By a strict definition, if a relationship between variables it isn’t appropriate for Pearson’s r, then it ain’t correlation. We are supposed to call it by the more vague term “association.”

Hey, I’m cool with that. I’ll call it whatever you want. But what are the practical implications of all this?

As near as I can tell, ZERO.

Uh-huh, I’ve just made you read more than a thousand words on how complicated correlation is. Now I’m going to dismiss all of that with an imperial wave of my hand. I need to you ignore everything I’ve just said, for two reasons which I will elaborate on in a moment:

  1. As I stated at the outset, the only reason I calculate correlations is to explore which variables are most likely to figure prominently in a regression analysis. For a rough ranking of variables, Pearson’s r is a “good enough” tool.
  2. The correlation r is the basis of linear regression, which is our end-goal. Not Spearman, not point-biserial, nor any other measure.

Regarding point number one: We are not concerned about the precise value of the calculated association, only the approximate ranking of our variables. All we want to know is: which variables are probably most valuable for our model and should be added to the regression first? As it turns out, rankings using other measures of correlation (sorry — association) hardly vary from a Pearson’s r ranking. It’s extra bother for nothing.

And to point number two: There’s a real disconnect between what the textbooks say about correlation analysis and what they say about regression. It seems to me that if Pearson’s r is inappropriate for all but continuous, linearly-related variables, then we would also be told that only continuous, linearly-related variables can be used in regression. That’s not the case: Social-science researchers and modelers toss ordinal and binary variables into regressions with wild abandon. If we didn’t we’d have almost nothing left to work with.

The disconnect is bridged with an explanation I found in one university stats textbook. It’s touched on only briefly, and towards the end of the book. The gist is this: For 0/1 indicator variables added to a linear regression, the coefficient of correlation is not the slope of a line, as we are always told to understand it. The indicator acts to vertically shift the line, so that instead of one regression slope, we have two: The unshifted line if the indicator variable is equal to zero, and another line shifted vertically up or down (depending on the sign of the coefficient), if the value is 1. This seems like essential information, but hardly rates any discussion at all. That’s stats for you.

To summarize:

  1. Don’t be sidetracked by warnings regarding measures of association/correlation that have specific uses but do not relate to your end goal: Building a regression model for the pragmatic purpose of making predictions.
  2. Most of the time, scatterplots can’t tell you what you need to know, because most of our data is categorical.
  3. Indicator variables (and ordinal variables) are materially different from the pretty, linearly-related variables, and they are absolutely OK for use in regression.
  4. SAY “association”, but DO “correlation”.
  5. Don’t feel bad if you’re having difficulty learning predictive modeling from a stats textbook. I can’t see how anyone could.

11 January 2011

Blackjack, and the predictor that will bust your hand

Filed under: Pitfalls, Predictor variables, regression — Tags: , , — kevinmacdonell @ 6:29 am

This hand looks great, but not for this game. (Image used via Creative Commons license. Click image for source.)

When a new predictor variable emerges that seems too good to be true, it probably is. Fortunately, these examples of fool’s gold are fairly easy to detect and remove — if you’re careful.

I take a methodical approach to model-building, following the methods I’ve been taught. I choose not to use a semi-automated process that tosses all the predictor variables into the analysis at once and picks the best for you (stepwise regression). Rather, I add variables to the regression table one by one; I observe their effects on the p-values of previously-added variables, their effects on adjusted R-squared, and whether an effect is different than what I had expected (a negative coefficient instead of a positive one, or vice-versa).

It’s a bit like playing blackjack. You can stand, or you can ask to be hit with another variable. Unlike blackjack, if you do go over 21, you can always tell the dealer to take the last variable back and stop there.

My guide in choosing which variable to add next is its correlation coefficient (Pearson Product-Moment), a numerical value that indicates strength of linear correlation with my dependent variable. So I’m working with a deck of cards I’ve stacked in advance: All of my 80 or 90 predictors, sorted in descending order by strength of correlation (positive OR negative).

This deck is a suggestion only: There’s no telling in advance how a variable will interact with others already in the regression. A variable strongly correlated with Lifetime Giving may be left out of the model because it happens to also be highly correlated with other predictor variables added previously. It may add insufficient additional predictive value to the model to keep, and its inclusion might introduce unwanted noise in the scores I will output later.

In other words, returning to blackjack, every variable is an ace: Could be counted as 11, could be counted as only 1.

It’s always interesting, though, to see what variables are shuffled to the top of the deck, particularly if they’re new to you and you don’t know much about them. So it was with a variable I encountered recently: “Society Name”, from which I created a 0/1 indicator variable, “Society Name Present.” Its Pearson correlation with Lifetime Giving (log transformed) was 0.582, which is quite high, especially compared with the next value down, Number of Student Activities, with a value of 0.474.

I didn’t know what a “society name” was, but I went ahead anyway and entered it into a regression with LT Giving as the dependent variable. Right off the bat, my R squared shot up to well over 33%. In other words, this one variable was accounting for more than a third of the variability in the value of LT Giving.

Good news? NO! I busted!

For this sort of data set, the whole alumni population — donors AND non-donors, and all ages — the most powerful predictor is normally Age, or a proxy of age, such as Class Year, or some transformation of either of these. For a relatively obscure variable to suddenly claim top spot in a modeling scenario that I’m already so familiar with made it highly suspect.

I should have stopped there, but I continued developing the model. My final value for adjusted R squared was 57.5%, the highest R value I’ve ever encountered.

What’s the harm, you ask? Well, it finally dawned on me, rather late, that this suspect variable was probably a proxy for “donor.” If you are trying to predict likelihood of becoming a donor, out of a pool that includes donors and non-donors, leaving this powerful, non-independent variable in the model will cause it to “predict” that your donors will become donors, and your non-donors won’t. This is not adding any intelligence to your prospecting.

Here’s where it pays to know your data. In this case, I didn’t. An inquiry led to the truth, that Society Name actually referred to a Gift Society name and was tied up with the characteristic of being either a donor or a prospective donor. But even without knowing this, I could have determined the variable’s unsuitability by asking myself two related questions:

  1. Do donors represent a majority of the cases that have one of the two values (0 or 1) of the variable?
  2. Are a majority of donors coded either 0 or 1 for this variable?

If the answer to either question is “yes,” then you may have a variable that is not sufficiently independent for a model intended to predict likelihood of being a donor. In this case, only 31% of alumni with a Society Name present are donors, so that’s OK. But more than 97% of donors have a Society Name. That’s suspicious.

As a rule of thumb, then, treat any Pearson correlation value of 0.5 and above with caution, ask the two related questions above, and talk to whomever controls the data entry for the field in question to find out how that data gets in there and what it’s for. Leaving the variable out will drag your R-squared down, but you’ll end up with a much better hand against the dealer.

3 September 2010

More on making hay from variables that interact

A short while ago I wrote about pairs of predictor variables that are highly correlated with each other, i.e. that have strong interactions in regression analysis. (Making hay when predictor variables interact.) The example I used was Position Present and Employer Name Present. Instead of using one and throwing the other away as redundant, you can combine them to form a new variable with more predictive power than either of the original two on their own.

In this post, I’ll show you how to identify other likely pairs of variables from which you can try to make similar “combination variables.”

When independent variables interact in regression, it’s called multicollinearity. (You can read a good discussion of multicollinearity on the Stats Make Me Cry blog: Top Ten Confusing Stats Terms Explained in Plain English, #9: Multicollinearity.) Position Present and Employer Name Present is an obvious example, but all kinds of subtle combinations are possible and difficult to foresee. We need to call on some help in detecting the interaction. That help is provided by Pearson’s Product-Moment Correlation, also known as Pearson’s r. I’ve written about Pearson’s r before.

In a nutshell, Pearson’s r calculates a number that describes the strength of linear correlation between any two variables. Your stats software makes this easy. In DataDesk, I select all the icons of the variables I want to calculate correlations for, then find Pearson’s r in the menu. The result is a new window containing a table full of values. If you include many variables at once, this table will be massive. Click on the image below for a full-size version of a Pearson table based on some real data from a university. (You might have to enlarge it in your browser.) The table works exactly like those distance tables you find on old tourist highway maps (they don’t seem to make those anymore — wonder why); to find the distance from, say Albuquerque to Santa Fe, you’d find the number at the intersection of the Albuquerque column with the Santa Fe row, and that would be the number of miles to travel. In the table below, the cities are variables and the mileage is Pearson’s r.

Don’t be intimidated by all the numbers! Just let your eye wander over them. Notice that some are positive, some negative. The negative sign simply means that the correlation between the two variables is negative. Notice also that most of the numbers are small, less than 0.1 (or minus 0.1). As far as multicollinearity is concerned, we’re most interested in these bigger values, i.e. values that are furthest from zero (no correlation) and closest to 1 (perfect linear correlation).

I realize some of the variable names will be a bit mysterious, but you might be able to guess that “Number deg” is Number of Degrees and that “Grad Y” means Graduated. Their Pearson’s r value (0.26) is one of the higher correlations, which makes sense, right?

Noticing certain correlations can teach you things about the data. ‘Female’ is correlated with ‘Class (Year)’ — because at this university, males outnumbered females years ago, but since the 1980s, females have outnumbered males by an ever-increasing factor. On the other hand, ‘Number HC’ (campus reunions attended) is negatively correlated with ‘Class Year’ — older alumni have attended more events (no big surprise), but also young alumni are not big on reunions at this institution.

Look at ‘Business Phone Present’ and ‘Employer Present’. Their r value is relatively high (0.376). I would test some variations of those two. You could add them together, so that the variable could range from 0 to 2. Or you could multiply them, to give you a binary variable that would have a value of 1 only if both of the original variables was 1. You might end up with a predictor that is more highly correlated with ‘Giving’ than either of the original two variables.

With non-binary variables such as ‘Class Year’ and ‘Number Events Attended’, the results of combining will be even more varied and interesting. What you do is up to you; there’s no harm in trying. When you’re done playing, just rank all your old and new variables in order by the absolute value of their strength of linear correlation with your predicted value (say, Giving), and see how the new variables fare.

28 April 2010

Pearson product-moment correlation coefficient

Filed under: Data Desk, Model building, regression, Statistics — Tags: , , , — kevinmacdonell @ 8:54 am

You can’t have a serious blog post related to statistics without tossing in the name of a dead white guy. How dead? Well, yesterday was the 74th anniversary of Karl Pearson‘s death in Surrey, England. How white? Pearson was a fan of eugenics, social Darwinism, and the “struggle of race with race” – with the supposedly best race winning. Charming! Like him or not, his contribution to statistics (and therefore science) was huge.

I’ve written a bit about multiple linear regression, but not a great deal about how to do it. After I’ve got my data file ready and before I open up a regression window in Data Desk, ranking my predictor variables using ol’ Pearson’s measure of correlation is Step One. The object is to rank your predictor variables (a.k.a. independent variables) according to the strength (either positive or negative) of their linear correlation with your predicted value (a.k.a. dependent variable). I do this in order to determine the order in which I will introduce my predictor variables into my regression analysis. (All of the following discussion assumes that your variables are numerical; either continuous variables such as ‘class year’, or 0/1 indicator variables you’ve created from your categorical variables.)

The tool I’ve been taught to use is Pearson’s Product-Moment Correlation Coefficient, also called Pearson’s r. This is a quantitative tool which yields a coefficient that describes the slope of the line (as can best be determined) between your Y variable (say, ‘giving’) and one of your X’s (say, ‘class year’). A value of 1 denotes a perfectly linear correlation in a positive direction, and a value of -1 is a perfect negative correlation. All possible values fall between 1 and minus 1. Values near zero denote the absence of a linear correlation; they might be correlated in some other way, but not linearly.

(Click for larger view.)

A scatterplot of the two variables ‘giving‘ and ‘class year’, shown here, will reveal a relationship visually: ‘giving’ tends to decrease as ‘class year’ increases. This negative linear relationship is indicated by the downward-sloping line. The Pearson Correlation reveals the same thing, but not visually: it puts an actual number to it, and that number describes the slope of the line. Why is this important? Because for many of your variables, a scatterplot is just going to look like a mess – the linear relationship is in there somewhere, but it’s not evident from the cloud of points. If you have a calculated value instead, you can easily decide which linear relationships demand priority attention.

It’s easy to do in Data Desk. Just select the icon for ‘giving’ as your Y and also select all of your predictor variables (x), then go to Calc in the menu. Select Correlations, then Pearson Product-Moment. If you have a lot of variables, the resulting table will be impressively large. It will look fearsome or beautiful, depending on how you feel about being faced with a wall of numbers. (I think it’s gorgeous.) To find the value that relates one variable to another, find the intersection of the row and column of the two variables. For example, in the table below, the Pearson correlation value for ‘giving’ and ‘class year’ is -0.460. (The correlation of any variable with itself is, of course, a perfect 1.)

Have a look around this table. Don’t be concerned about the actual values. Just see which values are higher than others. For instance, look at the intersection of “position present” and “employer present“. It’s a very high value: 0.812, which is very close to 1! This tells us that these two predictor variables are going to “interact” with each other when we bring them into the regression analysis. It makes sense: Job title and employer name are likely to be present or absent in tandem, although not perfectly. The practical result is that one of these variables will prove to be a significant predictor, while the other adds little or nothing new, and will be left out.

So how do I decide which variable gets added before the other? It’s simple.

The only part of the whole Pearson table that we’re interested in is the column of values under the heading ‘giving’. Data Desk allows us to copy the table as text and paste it into Excel. When I do this and strip out all the stuff I’m not interested in, the result looks like this. (I’ve resorted the variables alphabetically).

Next, I sort the variables according to their Pearson correlation with Giving. The variables with the highest values will head the list. But notice a small problem: The strongest NEGATIVE variables end up at the very bottom. Really, with its high correlation with ‘giving’, the class year variable should rank first. So I do one extra step, creating a column with an Excel formula for the absolute value of the Pearson coefficient (i.e. without the minus sign), and re-sort on that value.

This gives me a clear idea of the order in which I should add variables to the regression. For example, ’employer present’ seems to edge out ‘position present’. Due to variable interaction, though, the final roster of which variables will stay and which will go is NOT evident at this point. The proof is in the regression, where all sorts of interesting and unforeseeable interactions may crop up.

You don’t have to take this manual approach to adding your variables – your software probably offers an automated, or partially automated, method called stepwise regression. But after all the work of preparing my predictors, I enjoy watching the way they interact with each other as I work through training the model. The way I see it, the more hands-on you are with your analysis, the more you learn about your data.

Final note: The examples above actually use a transformed value of ‘giving’ – the log of giving. Transforming our dependent variable using a logarithmic function is a perfectly valid way to make the linear relationships among variables much more evident. (Why we transform variables is explained more fully here.) If we used ‘giving’ just as it is, the Pearson values would be very low, which would indicate only a very weak linear relationship. Even ‘class year’ would have a low value, which we know isn’t a good description of the reality, which is better represented by the scatterplot above.

Create a free website or blog at WordPress.com.