CoolData blog

22 September 2014

What predictor variables should you avoid? Depends on who you ask

People who build predictive models will tell you that there are certain variables you should avoid using as predictors. I am one of those people. However, we disagree on WHICH variables one should avoid, and increasingly this conflicting advice is confusing those trying to learn predictive modeling.

The differences involve two points in particular. Assuming charitable giving is the behaviour we’re modelling for, those two things are:

  1. Whether we should use past giving to predict future giving, and
  2. Whether attributes such as marital status are really predictors of giving.

I will offer my opinions on both points. Note that they are opinions, not definitive answers.

1. Past giving as a predictor

I have always stressed that if you are trying to predict “giving” using a multiple linear regression model, you must avoid using “giving” as a predictor among your independent variables. That includes anything that is a proxy for “giving,” such as attendance at a donor-thanking event. This is how I’ve been taught and that is what I’ve adhered to in practice.

Examples that violate this practice keep popping up, however. I have an email from Atsuko Umeki, IT Coordinator in the Development Office of the University of Victoria in Victoria, British Columbia*. She poses this question about a post I wrote in July 2013:

“In this post you said, ‘In predictive models, giving and variables related to the activity of giving are usually excluded as variables (if ‘giving’ is what we are trying to predict). Using any aspect of the target variable as an input is bad practice in predictive modelling and is carefully avoided.’  However, in many articles and classes I read and took I was advised or instructed to include past giving history such as RFA*, Average gift, Past 3 or 5 year total giving, last gift etc. Theoretically I understand what you say because past giving is related to the target variable (giving likelihood); therefore, it will be biased. But in practice most practitioners include past giving as variables and especially RFA seems to be a good variable to include.”

(* RFA is a variation of the more familiar RFM score, based on giving history — Recency, Frequency, and Monetary value.)

So modellers-in-training are being told to go ahead and use ‘giving’ to predict ‘giving’, but that’s not all: Certain analytics vendors also routinely include variables based on past giving as predictors of future giving. Not long ago I sat in on a webinar hosted by a consultant, which referenced the work of one well-known analytics vendor (no need to name the vendor here) in which it seemed that giving behaviour was present on both sides of the regression equation. Not surprisingly, this vendor “achieved” a fantastic R-squared value of 86%. (Fantastic as in “like a fantasy,” perhaps?)

This is not as arcane or technical as it sounds. When you use giving to predict giving, you are essentially saying, “The people who will make big gifts in the future are the ones who have made big gifts in the past.” This is actually true! The thing is, you don’t need a predictive model to produce such a prospect list; all you need is a list of your top donors.

Now, this might be reassuring to whomever is paying a vendor big bucks to create the model. That person sees names they recognize, and they think, ah, good — we are not too far off the mark. And if you’re trying to convince your boss of the value of predictive modelling, he or she might like to see the upper ranks filled with familiar names.

I don’t find any of that “reassuring.” I find it a waste of time and effort — a fancy and expensive way to produce a list of the usual suspects.

If you want to know who has given you a lot of money, you make a list of everyone in your database and sort it in descending order by total amount given. If you want to predict who in your database is most likely to give you a lot of money in the future, build a predictive model using predictors that are associated with having given large amounts of money. Here is the key point … if you include “predictors” that mean the same thing as “has given a lot of money,” then the result of your model is not going to look like a list of future givers — it’s going to look more like your historical list of past givers.

Does that mean you should ignore giving history? No! Ideally you’d like to identify the donors who have made four-figure gifts who really have the capacity and affinity to make six-figure gifts. You won’t find them using past giving as a predictor, because your model will be blinded by the stars. The variables that represent giving history will cause all other affinity-related variables to pale in comparison. Many will be rejected from the model for being not significant or for adding nothing additional to the model’s ability to explain the variance in the outcome variable.

To sum up, here are the two big problems with using past giving to predict future giving:

  1. The resulting insights are sensible but not very interesting: People who gave before tend to give again. Or, stated another way: “Donors will be donors.” Fundraisers don’t need data scientists to tell them that.
  2. Giving-related independent variables will be so highly correlated with giving-related dependent variables that they will eclipse more subtle affinity-related variables. Weaker predictors will end up getting kicked out of our regression analysis because they can’t move the needle on R-squared, or because they don’t register as significant. Yet, it’s these weaker variables that we need in order to identify new prospects.

Let’s try a thought experiment. What if I told you that I had a secret predictor that, once introduced into a regression analysis, could explain 100% of the variance in the dependent variable ‘Lifetime Giving’? That’s right — the highest value for R-squared possible, all with a single predictor. Would you pay me a lot of money for that? What is this magic variable that perfectly models the variance in ‘Lifetime Giving’? Why, it is none other than ‘Lifetime Giving’ itself! Any variable is perfectly correlated with itself, so why look any farther?

This is an extreme example. In a real predictive model, a predictor based on giving history would be restricted to giving from the past, while the outcome variable would be calculated from a more recent period — the last year or whatever. There should be no overlap. R-squared would not be 100%, but it would be very high.

The R-squared statistic is useful for guiding you as you add variables to a regression analysis, or for comparing similar models in terms of fit with the data. It is not terribly useful for deciding whether any one model is good or bad. A model with an R-squared of 15% may be highly valuable, while one with R-squared of 75% may be garbage. If a vendor is trying to sell you on a model they built based on a high R-squared alone, they are misleading you.

The goal of predictive modeling for major gifts is not to maximize R-squared. It’s to identify new prospects.

2. Using “attributes” as predictors

Another thing about that webinar bugged me. The same vendor advised us to “select variables with caution, avoiding ‘descriptors’ and focusing on potential predictors.” Specifically, we were warned that a marital status of ‘married’ will emerge as correlated with giving. Don’t be fooled! That’s not a predictor, they said.

So let me get this straight. We carry out an analysis that reveals that married people are more likely to give large gifts, that donors with more than one degree are more likely to give large gifts, that donors who have email addresses and business phone numbers in the database are more likely to give large gifts … but we are supposed to ignore all that?

The problem might not be the use of “descriptors,” the problem might be with the terminology. Maybe we need to stop using the word “predictor”. One experienced practitioner, Alexander Oftelie, briefly touched on this nuance in a recent blog post. I quote, (emphasis added by me):

“Data that on its own may seem unimportant — the channel someone donates, declining to receive the mug or calendar, preferring email to direct mail, or making ‘white mail’ or unsolicited gifts beyond their sustaining-gift donation — can be very powerful when they are brought together to paint a picture of engagement and interaction. Knowing who someone is isn’t by itself predictive (at best it may be correlated). Knowing how constituents choose to engage or not engage with your organization are the most powerful ingredients we have, and its already in our own garden.”

I don’t intend to critique Alexander’s post, which isn’t even on this particular topic. (It’s a good one – please read it.) But since he’s written this, permit me scratch my head about it a bit.

In fact, I think I agree with him that there is a distinction between a behaviour and a descriptor/attribute. A behaviour, an action taken at a specific point in time (eg., attending an event), can be classified as a predictor. An attribute (“who someone is,” eg., whether they are married or single) is better described as a correlate. I would also be willing to bet that if we carefully compared behavioural variables to attribute variables, the behaviours would outperform, as Alexander says.

In practice, however, we don’t need to make that distinction. If we are using regression to build our models, we are concerned solely and completely with correlation. To say “at best it may be correlated” suggests that predictive modellers have something better at their disposal that they should be using instead of correlation. What is it? I don’t know, and Alexander doesn’t say.

If in a given data set, we can demonstrate that being married is associated with likelihood to make a donation, then it only makes sense to use that variable in our model. Choosing to exclude it based on our assumption that it’s an attribute and not a behaviour doesn’t make business sense. We are looking for practical results, after all, not chasing some notion of purity. And let’s not fool ourselves, or clients, that we are getting down to causation. We aren’t.

Consider that at least some “attributes” can be stated in terms of a behaviour. People get married — that’s a behaviour, although not related to our institution. People get married and also tell us about it (or allow it to be public knowledge so that we can record it) — that’s also a behaviour, and potentially an interaction with us. And on the other side of the coin, behaviours or interactions can be stated as attributes — a person can be an event attendee, a donor, a taker of surveys.

If my analysis informs me that widowed female alumni over the age of 60 are extremely good candidates for a conversation about Planned Giving, then are you really going to tell me I’m wrong to act on that information, just because sex, age and being widowed are not “behaviours” that a person voluntarily carries out? Mmmm — sorry!

Call it quibbling over semantics if you like, but don’t assume it’s so easy to draw a circle around true predictors. There is only one way to surface predictors, which is to take a snapshot of all potentially relevant variables at a point in time, then gather data on the outcome you wish to predict (eg., giving) after that point in time, and then assess each variable in terms of the strength of association with that outcome. The tools we use to make that assessment are nothing other than correlation and significance. Again, if there are other tools in common usage, then I don’t know about them.

Caveats and concessions

I don’t maintain that this or that practice is “wrong” in all cases, nor do I insist on rules that apply universally. There’s a lot of art in this science, after all.

Using giving history as a predictor:

  • One may use some aspects of giving to predict outcomes that are not precisely the same as ‘Giving’, for example, likelihood to enter into a Planned Giving arrangement. The required degree of difference between predictors and outcome is a matter of judgement. I usually err on the side of scrupulously avoiding ANY leakage of the outcome side of the equation into the predictor side — but sure, rules can be bent.
  • I’ve explored the use of very early giving (the existence and size of gifts made by donors before age 30) to predict significant giving late in life. (See Mine your donor data with this baseball-inspired analysis.) But even then, I don’t use that as a variable in a model; it’s more of a flag used to help select prospects, in addition to modeling.

Using descriptors/attributes as predictors:

  • Some variables of this sort will appear to have subtly predictive effects in-model, effects that disappear when the model is deployed and new data starts coming in. That’s regrettable, but it’s something you can learn from — not a reason to toss all such variables into the trash, untested. The association between marital status and giving might be just a spurious correlation — or it might not be.
  • Business knowledge mixed with common sense will help keep you out of trouble. A bit of reflection should lead you to consider using ‘Married’ or ‘Number of Degrees’, while ignoring ‘Birth Month’ or ‘Eye Colour’. (Or astrological sign!)

There are many approaches one can take with predictive modeling, and naturally one may feel that one’s chosen method is “best”. The only sure way to proceed is to take the time to define exactly what you want to predict, try more than one approach, and then evaluate the performance of the scores when you have actual results available — which could be a year after deployment. We can listen to what experts are telling us, but it’s more important to listen to what the data is telling us.

//////////

Note: When I originally posted this, I referred to Atsuko Umeki as “he”. I apologize for this careless error and for whatever erroneous assumption that must have prompted it.

16 January 2012

Address updates and affinity: Consider the source

Filed under: Correlation, Predictor variables, skeptics — Tags: , , , , — kevinmacdonell @ 1:03 pm

Some of the best predictors in my models are related to the presence or absence of phone numbers and addresses. For example, the presence of a business phone is usually a highly significant predictor of giving. As well, a count of either phone or address updates present in the database is also highly correlated with giving.

Some people have difficulty accepting this as useful information. The most common objection I hear is that such updates can easily come from research and data appends, and are therefore not signals of affinity at all. And that would be true: Any data that exists solely because you bought it or looked it up doesn’t tell you how someone feels about your institution. (Aside from the fact that you had to go looking for them in the first place — which I’ve observed is negatively correlated with giving.)

Sometimes this objection comes from someone who is just learning data mining. Then I know I’m dealing with someone who’s perceptive. They obviously get it, to some degree — they understand there’s potentially a problem.

I’m less impressed when I hear it from knowledgeable people, who say they avoid contact information in their variable selection altogether. I think that’s a shame, and a signal that they aren’t willing to put in the work to a) understand the data they’re working with, or b) take steps to counteract the perceived taint in the data.

If you took the trouble to understand your data (and why wouldn’t you), you’d find out soon enough if the variables are useable:

  • If the majority of phone numbers or business addresses or what-have-you are present in the database only because they came off donors’ cheques, then you’re right in not using it to predict giving. It’s not independent of giving and will harm your model. The telltale sign might be a correlation with the target variable that exceeds correlations for all your other variables.
  • If the information could have come to you any number of ways (with gift transactions being only one of them), then use with caution. That is, be alert if the correlation looks too good to be true. This is the most likely scenario, which I will discuss in detail shortly.
  • If the information could only have come from data appends or research, then you’ve got nothing much to worry about: The correlation with giving will be so weak that the variable probably won’t make it into your model at all. Or it may be a negative predictor, highlighting the people who allowed themselves to become lost in the first place. An exception to the “don’t worry” policy would be if research is conducted mainly to find past donors who have become lost — then there might be a strong correlation that will lead you astray.

An in-house predictive modeler will simply know what the case is, or will take the trouble to find out. A vendor hired to do the work may or may not bother — I don’t know. As far as my own models are concerned, I know that addresses and phone numbers come to us via a mix of voluntary and involuntary means: Via Phonathon, forms on the website, records research, and so on.

I’ve found that a simple count of all historical address updates for each alum is positively correlated with giving. But a line plot of the relationship between number of address updates and average lifetime giving suggests there’s more going on under the surface. Average lifetime giving goes up sharply for the first half-dozen or so updates, and then falls away just as sharply. This might indicate a couple of opposing forces: Alumni who keep us informed of their locations are more likely to be donors, but alumni who are perpetually lost and need to be found via research are less likely to be donors.

If you’re lucky, your database not only has a field in which to record the source of updates, but your records office is making good use of it. Our database happens to have almost 40 different codes for the source, applied to some 300,000 changes of address and/or phone number. Not surprisingly, some of these are not in regular use — some account for fewer than one-tenth of one percent of updates, and will have no significance in a model on their own.

For the most common source types, though, an analysis of their association with giving is very interesting. Some codes are positively correlated with giving, some negatively. In most cases, a variable is positive or negative depending on whether the update was triggered by the alum (positive), or by the institution (negative). On the other hand, address updates that come to us via Phonathon are negatively correlated with giving, possibly because by-mail donors tend not to need a phone call — if ‘giving’ were restricted to phone solicitation only, perhaps the association might flip toward the positive. Other variables that I thought should be positive were actually flat. But it’s all interesting stuff.

For every source code, a line plot of average LT giving and number of updates is useful, because the relationship is rarely linear. The relationship might be positive up to a point, then drop off sharply, or maybe the reverse will be true. Knowing this will suggest ways to re-express the variable. I’ve found that alumni who have a single update based on the National Change of Address database have given more than alumni who have no NCOA updates. However, average giving plummets for every additional NCOA update. If we have to keep going out there to find you, it probably means you don’t want to be found!

Classifying contact updates by source is more work, of course, and it won’t always pay off. But it’s worth exploring if your goal is to produce better, more accurate models.

24 March 2011

Does your astrological sign predict whether you’ll give?

Filed under: Alumni, Correlation, Peter Wylie, Predictor variables, Statistics — Tags: , , — kevinmacdonell @ 7:20 am

Last weekend, with so many other pressing things I could have been doing, I got it in my head to analyze people’s astrological signs for potential association with propensity to give. I don’t know what came over me; perhaps it was the Supermoon. But when you’ve got a data set in front of you that contains giving history and good birth dates for nearly 85,000 alumni, why not?

Let me say first that I put no stock in astrology, but I know a few people who think being a Libra or a Gemini makes some sort of difference. I imagine there are many more who are into Chinese astrology, who think the same about being a Rat or a Monkey. And even I have to admit that an irrational aspect of me embraces my Taurus/Rooster nature.

If one’s sign implies anything about personality or fortune, I should think it would be reflected in one’s generosity. Ever in pursuit of the truth, I spent a rather tedious hour parsing 85,000 birth dates into the signs of the zodiac and the animal signs of Chinese astrology. As you will see, there are in fact some interesting patterns associated with birth date, on the surface at least.

Because human beings mate at any time of year, the alumni in the sample are roughly equally distributed among the 12 signs of the zodiac. There seem to be slightly more births in the warmer months than in the period of December to February: Cancer (June 21 to July 22) represents 8.9% of the sample while at the lower end, Capricorn (Dec 22 to Jan 20) represents 7.6% of the sample — a spread of less than two percentage points.

What we want to know is if any one sign is particularly likely to give to alma mater. I coded anyone who had any giving in their lifetime as ‘1’ and all never-donors as ‘0’. At the high end, Taurus natives have a donor rate of 30.7% and at the low end, Aries natives have a donor rate of 29.0%. All the other signs fall between those two rates, a range of a little more than one and a half percentage points.

That’s a very narrow range of variance. If I were seriously evaluating the variable ‘Astrological sign’ as a predictor, I would probably stop right there, seeing nothing exciting enough to make me continue.

But have a look at this bar chart. I’ve arranged the signs in their calendar order, which immediately suggests that there’s a pattern in the data: A peak at Taurus, gradually falling to Scorpio, peaking again at Sagittarius, then falling again until Taurus comes around once more.

The problem with the bar chart is that the differences in giving rates are exaggerated visually, because the range of variance is so limited. What appears to be a pattern may be nothing of the sort.

In fact, the next chart tells a conflicting tale. The Tauruses may have the highest participation rate, but among donors they and three other signs have the lowest median level of lifetime giving ($150), and Aries have the highest median ($172.50). The calendar-order effect we saw above has vanished.

These two charts fail to tell the same tale, which indicates to me that although we may observe some variance in giving between astrological signs, the variance might well be due to mere chance. Is there a way to demonstrate this statistically? I was discussing this recently with Peter Wylie, who helped me sort this out. Peter told me that the supposed pattern in the first chart reminded him of the opening of Malcolm Gladwell’s book, Outliers, in which the author examines why a hugely disproportionate number of professional hockey and soccer players are born in January, February and March. (I won’t go farther than that — read the book for that discussion.)

In the case of professional hockey players, birth date and a player’s development (and career progress) are definitely associated. It’s not due to a random effect. In the case of birth date and giving, however, there is room for doubt. Peter took me through the use of chi-square, a statistic I hadn’t encountered since high school. I’m not going into detail about chi-square — there is plenty out there online to read — but briefly, chi-square is used to determine if a distribution of observed frequencies of a value for a categorical or ordinal variable differs from the theoretical expected frequencies for that variable, and from there, if the discrepancy is statistically significant.

Figuring out the statistical significance part used to involve looking up the calculated value for chi-square in a table based on something called degrees of freedom, but nowadays your stats software will automatically provide you with a statistic telling you whether the result is significant or not: the p statistic, which will be familiar to you if you’ve used linear regression. The rule of thumb for significance is a p-value of 0.05 or less.

As it turns out, the observed differences in the frequency of donors for each astrological sign has a significance value of p = 0.3715. This is way above the 0.05 confidence level, and therefore we cannot rule out the possibility that these variations are due to mere chance. So astrology is a bust for fundraisers.

Now for something completely different. We haven’t looked at the Chinese animal signs yet. Here is a table showing a breakdown by Chinese astrological sign by the percentage of alumni with at least some giving, and median lifetime giving. The table is sorted by donor participation rate, lowest to highest.

Hmm, it would seem that being a Horse is associated with a higher level of generosity than the norm. And here’s the biggest surprise: A Chi-square test reveals the differences in donor frequencies between animals to be significant! (p-value < 0.0001).

What’s going on here? Shall we conclude that the Chinese astrologers have it all figured out?

Let’s go back to the data. First of all, how were alumni assigned an animal sign in the first place? You may be familiar with the paper placemats in Chinese restaurants that list birth years and their corresponding animal signs. Anyone born in the years 1900, 1912, 1924, 1936, 1948, 1960, 1972, 1984, 1996 or 2008 is a Rat. Anyone born in 1901, 1913, 1925, etc. etc. is an Ox, and so on, until all the years are accounted for. Because the alumni in each animal category are drawn from birth years with an equal span of years between them, we might assume that each sign has roughly the same average age. This is key, because if the signs differ on average age, then age might be an underlying cause of variations in giving.

My data set does not include anyone born before 1930, and goes up to 1993 (a single precocious alum who graduated at a very young age). Tigers, with the lowest participation rate, are drawn from the birth years 1938, 1950, 1962, 1974 and 1986. Horses, with the highest participation rate, are drawn from the birth years 1930, 1942, 1954, 1966 and 1978, plus only a handful of young alumni from 1990. For Tigers, 77% were born in 1974 or earlier, but for Horses, 99% of alumni were born in 1978 or earlier.

The bottom line is that the Horses in my data set are older than the Tigers, as a group. The Horses have a median age of 45, while the Tigers have a median age of 37. And we all know by now that older alumni are more likely to be donors.

Again, my conversation with Peter Wylie helped me figure this out statistically. The short answer is: After you’ve accounted for the age of alumni, variations in giving by animal sign are no longer significant.

(The longer answer is: If you perform a linear regression of Age on Lifetime Giving (log-transformed) and compute residuals, then run an Analysis of Variance (ANOVA) for the residuals and Animal Sign, the variance is NOT significant, p = 0.1118. The residuals can be thought of as Lifetime Giving with the explanatory effect of Age “washed out,” leaving only the unexplained variance. Animal Sign fails to account for any significant amount of the remaining variance in LT Giving, which is an indication that Animal Sign is just a proxy for Age.)

Does any of this matter? Mostly no. First of all, a little common sense can keep you out of trouble. Sure, some significant predictors will be non-intuitive, but it doesn’t hurt to be skeptical. Second, if you do happen to prepare some predictors based on astrological sign, their non-significance will be evident as soon as you add them to your regression analysis, particularly if you’ve already added Age or Class Year as a predictor in the case of the Chinese signs. Altogether, then, the risk that your models will be harmed by such meaningless variables is very low.

14 March 2011

Correlation and you

Filed under: Correlation, Predictor variables, regression, Statistics — Tags: , , — kevinmacdonell @ 7:25 am

If you read books and blogs on statistics, eventually your understanding of even the most basic concepts will start to smear. Things we think ought to be well-established by now are matters of controversy. Ranking high on the list of most slippery concepts is correlation. In today’s post, I’m going to make the concept very complicated, and then I’m going to dismiss the complexity and make it all simple again. I’m telling you this in advance so I don’t lose you partway through.

Correlation is the foundation of predictive modeling. The degree to which one variable x, changes its value in relation to a second variable y, either positively or negatively, is the very definition of x‘s usefulness in predicting the value of y. The tool I use to quantify the strength of that relationship is Pearson Product-Moment Correlation, or Pearson’s r, which some statistics texts simply call “correlation”. (It would help if you read the blog post on Pearson before reading this one.)

Before we go any farther, I need to explain why I want to quantify the strength of relationships. At the start of any modeling project, I have a hundred or so potential predictor variables in my data file. Some are going to be excellent predictors, most will be only so-so, and others will have little or no association with the outcome variable. I want to introduce variables into the regression analysis in an order that makes sense, so that the best predictors are added first. Due to the complexity of interactions among variables, there is no telling in advance what will actually happen as variables are added, so any list of variables ordered by strength of correlation is merely a rough guide.

What I DON’T use Pearson’s r for much is exploring variables. At the exploration stage, before modeling begins, I will look at a variable a number of ways in order to get a sense of how valuable the variable will be, and how I might want to transform it or re-express it to make it better. I will look at how the variable is distributed, and compare average and median giving between groups (for example, Home Phone Present, Y/N). These and other techniques, as described in Peter Wylie’s book “Data Mining for Fundraisers,” are simpler, more direct, and often more helpful than abstract measures of correlation.

It’s only after I’ve done the exploration work and tweaked the variables for maximum effect that I’m ready to rank them in order by their correlation values. So, with that out of the way, let’s look at Pearson’s r in more detail.

The textbooks make it abundantly clear: Pearson’s r quantifies the relationship between two continuous variables that are linearly related. Right away, we’ve got a problem: Very few of the variables I work with are continuous, and most of the relationships I see do not meet the definition of “linear”. Yet, I use Pearson’s r exclusively. Does this mean I’ve been misusing and abusing the method?

I don’t see it that way, but it is interesting to read how these things are discussed in the literature.

You can assess whether a relationship between two variables is roughly linear by looking at a scatterplot of the variables. Below is a scatterplot of ‘Giving’ (log-transformed) and ‘Age’, created in Data Desk. It’s a big, messy cloud of points (some 80,000 of them!). A lot of the relationship is hidden by overplotting (the overlapping of points) along the bottom line — that row of points at the zero giving mark represents non-donors, and there are many more of them near the young end of that line than there are near the older end. Still, at least you can see a vague linear relationship in the upward fanning of the data: As age increases, so does lifetime giving. A best-fit line through the data would slope upward from left to right, and therefore the Pearson correlation value is high.

We’ve got two continuous variables, and a linear relationship, so we get the Statistician’s Seal of Approval: It’s okay to use Pearson’s r to measure strength of correlation between these two variables.

Unfortunately, as I said, most of the variables we use in predictive modeling are not continuous, and they don’t look like much of anything in a scatterplot. Here’s a scatterplot of a Likert-scale survey response. The survey question asked alumni how likely they are to donate to alma mater, and the scale runs from 1 to 5, with 5 being “very likely.” This is not a continuous variable, because there are no possible intermediate values among the five levels. It’s ordinal. The plot with Lifetime Giving is difficult to interpret, but it sure isn’t linear:

Yes, the line of points for the highest response, 5, does extend higher than any other line in terms of lifetime giving. But due to overplotting, there is no way to tell how many nondonors lurk in the single dots that appear at the foot of every line and which indicate zero dollars given. This in no way resembles a cloud of points through which one can imagine a best-fit line being drawn. I can TELL you that a positive response to this question is strongly associated with high levels of lifetime giving, and it is, but you could be forgiven for remaining unconvinced by this “evidence”.

Even worse are the most common predictor variables: indicator variables, in binary form (0/1). For example, let’s say I express the condition of being ‘Married’ as a binary variable. A scatterplot of ‘Giving’ and ‘Married’ is even less useful than the one for the survey question:

Yuck! We’ve got 80,000 data points all jammed up in two solid lines at zero (not married) and 1 (married). We can’t tell from this plot, but it just so happens that the not-married line contains a lot of points sitting at lifetime giving of zero, far more than the married line. Being married is associated with giving, but who could tell from this? There’s no way this relationship is linear.

One of the tests for the appropriateness of using Pearson’s r is whether a scatterplot of the variable looks like a “straight enough” line. Another test is that both variables are quantitative and continuous — not categorical (or ordinal) and discrete. The fact that I can take a categorical variable (Married) and re-express it as a number (0/1) makes no difference. Turning it into a number makes it possible to calculate Pearson’s r, but that doesn’t make it okay to do so.

So the textbooks tell us. What else do they have to say? Read on.

There are methods other than Pearson’s r which we can use to measure the degree of association between two variables, which do not require the presence of a linear relationship. One of these is Spearman’s Rho, which is the correlation between the ranks of two variables. Rho replaces the data values themselves with their ranks within each variable — so the lowest value in each variable becomes ‘1’, the next lowest becomes ‘2’, and so on — and then calculates to what degree those values are related between the two variables, either negatively or positively.

Spearman’s Rho is sometimes called Spearman Rank Correlation, which muddies the waters a bit as it implies that it’s a measure of correlation. It is, but the correlation is between the ranks, not the data values themselves. The bottom line is that a statistician would tell us that Spearman is the appropriate calculation for putting a value to the strength of association between two variables that are not linearly related but which may show a consistently increasing or decreasing trend. Unlike Pearson’s r, it is a nonparametric method — it is free of any requirement that the distribution of the variables look a certain way.

Spearman’s Rho has some special properties which I won’t get into, but overall it looks a lot like Pearson’s r. It can take on values between -1 and 1 (a perfect negative relationship to a perfect positive relationship, both called monotone relationships); values near zero indicate the absence of a relationship — just like Pearson’s r. And your stats software makes it equally easy to compute.

So, great. We’ve got Spearman’s Rho, which we are told is just the thing for analyzing the Likert scale variable I showed you earlier. What about the indicator variable (for ‘Married’)? Well, no, we’re told: You can’t use Pearson or Spearman’s to calculate correlation for categorical variables. For that, you need the point-biserial correlation coefficient.

Huh?

That’s right, another measure of correlation. In fact, there are many types of measures out there. I’ve got a sheet in front of me right now that lists more than eight different measures, the choice of which depends on the combination of variables you’re analyzing (two continuous variables, one continuous and one ordinal variable, two binary variables, one binary and one ordinal, etc. etc.). And that list is not exhaustive.

Another wrinkle is that some texts don’t call these relationships “correlation” at all. By a strict definition, if a relationship between variables it isn’t appropriate for Pearson’s r, then it ain’t correlation. We are supposed to call it by the more vague term “association.”

Hey, I’m cool with that. I’ll call it whatever you want. But what are the practical implications of all this?

As near as I can tell, ZERO.

Uh-huh, I’ve just made you read more than a thousand words on how complicated correlation is. Now I’m going to dismiss all of that with an imperial wave of my hand. I need to you ignore everything I’ve just said, for two reasons which I will elaborate on in a moment:

  1. As I stated at the outset, the only reason I calculate correlations is to explore which variables are most likely to figure prominently in a regression analysis. For a rough ranking of variables, Pearson’s r is a “good enough” tool.
  2. The correlation r is the basis of linear regression, which is our end-goal. Not Spearman, not point-biserial, nor any other measure.

Regarding point number one: We are not concerned about the precise value of the calculated association, only the approximate ranking of our variables. All we want to know is: which variables are probably most valuable for our model and should be added to the regression first? As it turns out, rankings using other measures of correlation (sorry — association) hardly vary from a Pearson’s r ranking. It’s extra bother for nothing.

And to point number two: There’s a real disconnect between what the textbooks say about correlation analysis and what they say about regression. It seems to me that if Pearson’s r is inappropriate for all but continuous, linearly-related variables, then we would also be told that only continuous, linearly-related variables can be used in regression. That’s not the case: Social-science researchers and modelers toss ordinal and binary variables into regressions with wild abandon. If we didn’t we’d have almost nothing left to work with.

The disconnect is bridged with an explanation I found in one university stats textbook. It’s touched on only briefly, and towards the end of the book. The gist is this: For 0/1 indicator variables added to a linear regression, the coefficient of correlation is not the slope of a line, as we are always told to understand it. The indicator acts to vertically shift the line, so that instead of one regression slope, we have two: The unshifted line if the indicator variable is equal to zero, and another line shifted vertically up or down (depending on the sign of the coefficient), if the value is 1. This seems like essential information, but hardly rates any discussion at all. That’s stats for you.

To summarize:

  1. Don’t be sidetracked by warnings regarding measures of association/correlation that have specific uses but do not relate to your end goal: Building a regression model for the pragmatic purpose of making predictions.
  2. Most of the time, scatterplots can’t tell you what you need to know, because most of our data is categorical.
  3. Indicator variables (and ordinal variables) are materially different from the pretty, linearly-related variables, and they are absolutely OK for use in regression.
  4. SAY “association”, but DO “correlation”.
  5. Don’t feel bad if you’re having difficulty learning predictive modeling from a stats textbook. I can’t see how anyone could.

3 September 2010

More on making hay from variables that interact

A short while ago I wrote about pairs of predictor variables that are highly correlated with each other, i.e. that have strong interactions in regression analysis. (Making hay when predictor variables interact.) The example I used was Position Present and Employer Name Present. Instead of using one and throwing the other away as redundant, you can combine them to form a new variable with more predictive power than either of the original two on their own.

In this post, I’ll show you how to identify other likely pairs of variables from which you can try to make similar “combination variables.”

When independent variables interact in regression, it’s called multicollinearity. (You can read a good discussion of multicollinearity on the Stats Make Me Cry blog: Top Ten Confusing Stats Terms Explained in Plain English, #9: Multicollinearity.) Position Present and Employer Name Present is an obvious example, but all kinds of subtle combinations are possible and difficult to foresee. We need to call on some help in detecting the interaction. That help is provided by Pearson’s Product-Moment Correlation, also known as Pearson’s r. I’ve written about Pearson’s r before.

In a nutshell, Pearson’s r calculates a number that describes the strength of linear correlation between any two variables. Your stats software makes this easy. In DataDesk, I select all the icons of the variables I want to calculate correlations for, then find Pearson’s r in the menu. The result is a new window containing a table full of values. If you include many variables at once, this table will be massive. Click on the image below for a full-size version of a Pearson table based on some real data from a university. (You might have to enlarge it in your browser.) The table works exactly like those distance tables you find on old tourist highway maps (they don’t seem to make those anymore — wonder why); to find the distance from, say Albuquerque to Santa Fe, you’d find the number at the intersection of the Albuquerque column with the Santa Fe row, and that would be the number of miles to travel. In the table below, the cities are variables and the mileage is Pearson’s r.

Don’t be intimidated by all the numbers! Just let your eye wander over them. Notice that some are positive, some negative. The negative sign simply means that the correlation between the two variables is negative. Notice also that most of the numbers are small, less than 0.1 (or minus 0.1). As far as multicollinearity is concerned, we’re most interested in these bigger values, i.e. values that are furthest from zero (no correlation) and closest to 1 (perfect linear correlation).

I realize some of the variable names will be a bit mysterious, but you might be able to guess that “Number deg” is Number of Degrees and that “Grad Y” means Graduated. Their Pearson’s r value (0.26) is one of the higher correlations, which makes sense, right?

Noticing certain correlations can teach you things about the data. ‘Female’ is correlated with ‘Class (Year)’ — because at this university, males outnumbered females years ago, but since the 1980s, females have outnumbered males by an ever-increasing factor. On the other hand, ‘Number HC’ (campus reunions attended) is negatively correlated with ‘Class Year’ — older alumni have attended more events (no big surprise), but also young alumni are not big on reunions at this institution.

Look at ‘Business Phone Present’ and ‘Employer Present’. Their r value is relatively high (0.376). I would test some variations of those two. You could add them together, so that the variable could range from 0 to 2. Or you could multiply them, to give you a binary variable that would have a value of 1 only if both of the original variables was 1. You might end up with a predictor that is more highly correlated with ‘Giving’ than either of the original two variables.

With non-binary variables such as ‘Class Year’ and ‘Number Events Attended’, the results of combining will be even more varied and interesting. What you do is up to you; there’s no harm in trying. When you’re done playing, just rank all your old and new variables in order by the absolute value of their strength of linear correlation with your predicted value (say, Giving), and see how the new variables fare.

The Silver is the New Black Theme. Create a free website or blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 1,081 other followers