# CoolData blog

## 11 January 2011

### Blackjack, and the predictor that will bust your hand

Filed under: Pitfalls, Predictor variables, regression — Tags: , , — kevinmacdonell @ 6:29 am

This hand looks great, but not for this game. (Image used via Creative Commons license. Click image for source.)

When a new predictor variable emerges that seems too good to be true, it probably is. Fortunately, these examples of fool’s gold are fairly easy to detect and remove — if you’re careful.

I take a methodical approach to model-building, following the methods I’ve been taught. I choose not to use a semi-automated process that tosses all the predictor variables into the analysis at once and picks the best for you (stepwise regression). Rather, I add variables to the regression table one by one; I observe their effects on the p-values of previously-added variables, their effects on adjusted R-squared, and whether an effect is different than what I had expected (a negative coefficient instead of a positive one, or vice-versa).

It’s a bit like playing blackjack. You can stand, or you can ask to be hit with another variable. Unlike blackjack, if you do go over 21, you can always tell the dealer to take the last variable back and stop there.

My guide in choosing which variable to add next is its correlation coefficient (Pearson Product-Moment), a numerical value that indicates strength of linear correlation with my dependent variable. So I’m working with a deck of cards I’ve stacked in advance: All of my 80 or 90 predictors, sorted in descending order by strength of correlation (positive OR negative).

This deck is a suggestion only: There’s no telling in advance how a variable will interact with others already in the regression. A variable strongly correlated with Lifetime Giving may be left out of the model because it happens to also be highly correlated with other predictor variables added previously. It may add insufficient additional predictive value to the model to keep, and its inclusion might introduce unwanted noise in the scores I will output later.

In other words, returning to blackjack, every variable is an ace: Could be counted as 11, could be counted as only 1.

It’s always interesting, though, to see what variables are shuffled to the top of the deck, particularly if they’re new to you and you don’t know much about them. So it was with a variable I encountered recently: “Society Name”, from which I created a 0/1 indicator variable, “Society Name Present.” Its Pearson correlation with Lifetime Giving (log transformed) was 0.582, which is quite high, especially compared with the next value down, Number of Student Activities, with a value of 0.474.

I didn’t know what a “society name” was, but I went ahead anyway and entered it into a regression with LT Giving as the dependent variable. Right off the bat, my R squared shot up to well over 33%. In other words, this one variable was accounting for more than a third of the variability in the value of LT Giving.

Good news? NO! I busted!

For this sort of data set, the whole alumni population — donors AND non-donors, and all ages — the most powerful predictor is normally Age, or a proxy of age, such as Class Year, or some transformation of either of these. For a relatively obscure variable to suddenly claim top spot in a modeling scenario that I’m already so familiar with made it highly suspect.

I should have stopped there, but I continued developing the model. My final value for adjusted R squared was 57.5%, the highest R value I’ve ever encountered.

What’s the harm, you ask? Well, it finally dawned on me, rather late, that this suspect variable was probably a proxy for “donor.” If you are trying to predict likelihood of becoming a donor, out of a pool that includes donors and non-donors, leaving this powerful, non-independent variable in the model will cause it to “predict” that your donors will become donors, and your non-donors won’t. This is not adding any intelligence to your prospecting.

Here’s where it pays to know your data. In this case, I didn’t. An inquiry led to the truth, that Society Name actually referred to a Gift Society name and was tied up with the characteristic of being either a donor or a prospective donor. But even without knowing this, I could have determined the variable’s unsuitability by asking myself two related questions:

1. Do donors represent a majority of the cases that have one of the two values (0 or 1) of the variable?
2. Are a majority of donors coded either 0 or 1 for this variable?

If the answer to either question is “yes,” then you may have a variable that is not sufficiently independent for a model intended to predict likelihood of being a donor. In this case, only 31% of alumni with a Society Name present are donors, so that’s OK. But more than 97% of donors have a Society Name. That’s suspicious.

As a rule of thumb, then, treat any Pearson correlation value of 0.5 and above with caution, ask the two related questions above, and talk to whomever controls the data entry for the field in question to find out how that data gets in there and what it’s for. Leaving the variable out will drag your R-squared down, but you’ll end up with a much better hand against the dealer.

1. “Correlation does not mean causation…”

Comment by David Robertson — 11 January 2011 @ 9:21 am

2. “…but it sure is a hint!” — Edward Tufte

JJ

Comment by JJ — 11 January 2011 @ 11:50 am

3. Hi Kevin,

Very informative and well-written article. Just wondering what tools you’re using for modeling – Statistica?

Thanks!

Comment by Ted Kaiser — 11 January 2011 @ 12:45 pm

• Thanks – I use Data Desk. I don’t have enough experience with other stats/modeling software to make comparisons, but I the drag-and-drop nature of the interface in Data Desk encourages the one-at-a-time-and-see approach to doing regression analysis.

Comment by kevinmacdonell — 11 January 2011 @ 12:50 pm

4. I often wonder what is the best way to choose independent variables. I want the best of the collection of IV’s that are correlated to each other, like life giving and donor vs. nondonor, and I want the easiest programmable variables to land in my final query or equation. I often use cluster analysis and decision trees to try to walk around what you’re pondering here. But your article reminds me of good data preparation and exploration techniques.

Comment by Marianne Pelletier — 11 January 2011 @ 2:18 pm

5. I sometimes use my CHAID model as an example of co-linearity, the one where I added Life Giving as a variable against the dependent, donor/nondonor. The CHAID had a 100% prediction rate and used only Life Giving as its independent variable set, lol.

Comment by Marianne Pelletier — 18 April 2012 @ 10:36 am