# CoolData blog

## 11 January 2011

### Blackjack, and the predictor that will bust your hand

Filed under: Pitfalls, Predictor variables, regression — Tags: , , — kevinmacdonell @ 6:29 am

This hand looks great, but not for this game. (Image used via Creative Commons license. Click image for source.)

When a new predictor variable emerges that seems too good to be true, it probably is. Fortunately, these examples of fool’s gold are fairly easy to detect and remove — if you’re careful.

I take a methodical approach to model-building, following the methods I’ve been taught. I choose not to use a semi-automated process that tosses all the predictor variables into the analysis at once and picks the best for you (stepwise regression). Rather, I add variables to the regression table one by one; I observe their effects on the p-values of previously-added variables, their effects on adjusted R-squared, and whether an effect is different than what I had expected (a negative coefficient instead of a positive one, or vice-versa).

It’s a bit like playing blackjack. You can stand, or you can ask to be hit with another variable. Unlike blackjack, if you do go over 21, you can always tell the dealer to take the last variable back and stop there.

My guide in choosing which variable to add next is its correlation coefficient (Pearson Product-Moment), a numerical value that indicates strength of linear correlation with my dependent variable. So I’m working with a deck of cards I’ve stacked in advance: All of my 80 or 90 predictors, sorted in descending order by strength of correlation (positive OR negative).

This deck is a suggestion only: There’s no telling in advance how a variable will interact with others already in the regression. A variable strongly correlated with Lifetime Giving may be left out of the model because it happens to also be highly correlated with other predictor variables added previously. It may add insufficient additional predictive value to the model to keep, and its inclusion might introduce unwanted noise in the scores I will output later.

In other words, returning to blackjack, every variable is an ace: Could be counted as 11, could be counted as only 1.

It’s always interesting, though, to see what variables are shuffled to the top of the deck, particularly if they’re new to you and you don’t know much about them. So it was with a variable I encountered recently: “Society Name”, from which I created a 0/1 indicator variable, “Society Name Present.” Its Pearson correlation with Lifetime Giving (log transformed) was 0.582, which is quite high, especially compared with the next value down, Number of Student Activities, with a value of 0.474.

I didn’t know what a “society name” was, but I went ahead anyway and entered it into a regression with LT Giving as the dependent variable. Right off the bat, my R squared shot up to well over 33%. In other words, this one variable was accounting for more than a third of the variability in the value of LT Giving.

Good news? NO! I busted!

For this sort of data set, the whole alumni population — donors AND non-donors, and all ages — the most powerful predictor is normally Age, or a proxy of age, such as Class Year, or some transformation of either of these. For a relatively obscure variable to suddenly claim top spot in a modeling scenario that I’m already so familiar with made it highly suspect.

I should have stopped there, but I continued developing the model. My final value for adjusted R squared was 57.5%, the highest R value I’ve ever encountered.

What’s the harm, you ask? Well, it finally dawned on me, rather late, that this suspect variable was probably a proxy for “donor.” If you are trying to predict likelihood of becoming a donor, out of a pool that includes donors and non-donors, leaving this powerful, non-independent variable in the model will cause it to “predict” that your donors will become donors, and your non-donors won’t. This is not adding any intelligence to your prospecting.

Here’s where it pays to know your data. In this case, I didn’t. An inquiry led to the truth, that Society Name actually referred to a Gift Society name and was tied up with the characteristic of being either a donor or a prospective donor. But even without knowing this, I could have determined the variable’s unsuitability by asking myself two related questions:

1. Do donors represent a majority of the cases that have one of the two values (0 or 1) of the variable?
2. Are a majority of donors coded either 0 or 1 for this variable?

If the answer to either question is “yes,” then you may have a variable that is not sufficiently independent for a model intended to predict likelihood of being a donor. In this case, only 31% of alumni with a Society Name present are donors, so that’s OK. But more than 97% of donors have a Society Name. That’s suspicious.

As a rule of thumb, then, treat any Pearson correlation value of 0.5 and above with caution, ask the two related questions above, and talk to whomever controls the data entry for the field in question to find out how that data gets in there and what it’s for. Leaving the variable out will drag your R-squared down, but you’ll end up with a much better hand against the dealer.

## 26 August 2010

### What the heck IS “Y”, anyway?

Filed under: Model building, Predictive scores, regression, Statistics — Tags: — kevinmacdonell @ 8:46 am

At the APRA Conference in Anaheim a month ago, a session attendee was troubled by something he saw during a presentation given by David Robertson of Syracuse. The attendee was focused on the “constant” value in David’s example of a multiple linear regression model for propensity to give. This constant, which I will talk about shortly, was some significant figure, say “50”. Because the Y value (i.e., the dependent variable, or outcome variable, or predicted value) was expressed in dollars (of giving), then this seemed to indicate that the “floor” for giving, the minimum value someone could be predicted to give, was \$50.

How do you figure that?, this attendee wanted to know. It’s a reasonable question, for which I will try to provide my own answer. (More knowledgeable stats people may wish to weigh in; it would be appreciated.) There are implications here for how we interpret the predicted value of “Y”.

When you do a regression analysis, your software will automatically calculate this “constant,” which is simply the first term (“a”) in the regression equation:

In other words, if all your predictor variables (X’s) calculate out to zero, then Y will equal ‘a’. The part of this that the attendee found hard to swallow was that the minimum possible amount an alum could donate, as predicted by the model, was something greater than zero dollars. It seemed nonsensical.

Well, yes and no. First of all, the constant is no such thing. If you were to plot a regression line, that straight line has to cross the Y axis somewhere. The value of Y when the sum of X’s is zero is that crossing point (a.k.a. the Y-intercept). But that doesn’t mean it’s the minimum. Y does equal zero at a point: When the sum of predictors is negative — that is, when the regression line passes to the left of the Y axis and down, and crosses the X axis (the X-intercept).

So, really, you’re not learning much by looking at the constant. It’s a mathematical necessity — it describes an important aspect of what any line looks like when plotted — but that’s all. While the constant is always present in our regression analysis for predictive modeling, we tend to ignore it.

But all this is leading me to an even more fundamental question, the one posed in the headline: What is Y?

In David’s example, the one that so perplexed my fellow conference attendee, Y was expressed in real dollars. This is valid modeling practice. However, I have never looked at Y in real units (i.e., dollars), due to difficulty in interpreting the result. For example, the output of multiple linear regression can be negative: Does that mean the prospect is going to take money from us? As well, when we work with a transformed version of the DV (such as the natural log, which is very common), the output will need to be transformed back in order to make sense.

I sidestep issues of interpretation by simply assuming that the predicted value is meaningless in itself. What I am primarily interested in is relative probability, and where a value ranks in comparison with the values predicted for other individuals in the sample. In other words, is a prospect in the top 10% of alumni? Or the top 0.5%? Or the bottom 20%? The closer an individual is to the top of the heap, the more likely he or she is to give, and at higher levels.

I rank everyone in the sample by their predicted values, and then chop the sample up into deciles and percentiles. Percentiles, I am careful to explain, are not the same thing as probabilities: Someone in the 99th percentile is not 99% likely to make a gift. They might be 60% likely — it depends. The important thing is that someone in the 98th percentile will be slightly less likely to give, and someone in the 50th percentile will be MUCH less likely to give.

This highlights an important difference between multiple linear regression, which I’m talking about here, and binary logistic regression. The output of the latter form of regression is “probability”; very useful, and not so difficult to interpret. Not so with multiple linear regression — the output in this case is something different, which we may interpret in various ways but which will not directly give us values for probability.

Fortunately, fundraisers are already very familiar with the idea of ranking prospects in descending order by likelihood (or capacity, or inclination, or preferably some intelligent combination of these). Most people can readily understand what a percentile score means. For us data modelers, though, getting from “raw Y” to a neat score takes a little extra work.

## 9 June 2010

### Why multiple regression?

Filed under: Model building, regression, Statistics — Tags: , , , — kevinmacdonell @ 6:11 am

Not long ago I wrote about Pearson’s r, also known as Pearson’s Product-Moment Correlation Coefficient. This is a convenient statistical tool available in any stats software program (Excel can calculate it too) that yields a numerical measure of the strength of the correlation (linear dependence) between any two variables, X and Y.

I will show you how you can easily create a predictive score using only Pearson’s r — and why you probably shouldn’t!

Pearson’s r points the way toward weighting your predictor variables, according to how strongly correlated with your predicted value they actually are. If you assume all your predictor variables are worth the same (“1” for positive predictors and “-1” for negative predictors) you are imposing a subjective weighting on your variables. (I wrote about this limitation of the simple score in my most recent post, Beyond the simple score.)

Is Homecoming attendance more or less predictive than presence of an email address? Pearson’s r will tell you. Look at the example in the table above. Ten common predictor variables are listed in order of their strength of correlation with Lifetime Giving (log-transformed). It just so happens that the highest two correlations are negative; even though they are negative, according to their absolute value they are stronger correlations than any of the others in the list, so I put them at the top.  According to these values, “Email present” is relatively weak compared with “Number of Homecomings attended.”

Do you see where this is going? If you wanted to, you could use these correlations to directly create weighted scores for everyone in your database. Just multiply each variable by its Pearson value, sum up the products, and bingo — there’s your score.

You could do that, but I don’t think you should.

At least two of the predictor variables we want to use in our score are very closely related to each other: Employer present and Position present. They aren’t exactly alike: For some constituents you will have one piece of information and not the other. But on the whole, if one is present in your database for any given constituent, chances are you’ve got the other as well.

In other words, if you include both variables in your score, you’re double-counting the effect of employment information in your model — despite the fact that each is properly weighted by Pearson score. The reason is that Pearson’s r treats only two variables at a time: X and Y. It does NOT account for any interactions between multiple Xs.

Employment variables are only an obvious example. All of your variables will interact with each other to some degree, some strongly, others more subtly. By “interact with each other” I mean “correlate with each other.” In fact, we can use Pearson’s r to show which combinations of predictor variables are strongly correlated. The table below lists three variable pairs, drawn from real data, that exhibit strong interactions — including the employment example we’ve just mentioned.

The Pearson value for the employment variables is very close to 1, which indicates a nearly perfect positive correlation. The other two are more subtle, but make sense as well: Younger alumni will tend to be coded as Single in your database, and if we have a job title for someone, chances are we’ll also have a business phone number as well.

This overlapping of the explanatory effect of various X’s on Y will interfere with our ability to properly weight our predictors. Pearson’s Product-Moment Correlation Coefficient is important for understanding our variables, but not quite up to the task of directly created predictive scores. What now?

Well — multiple regression! Only regression will account for interactions among our predictor Xs, recalculating the coefficients (weightings) on the fly each time we add a new predictor variable. Working from the Pearson list at the top of this post, we would add Class year, Single, and Employer present to our regression window one by one. Everything would be fine up to that point; our p-values will be very low for these variables. When we add Position present, however, the p-value will be too high (0.183, which exceeds the rule-of-thumb value of 0.05), and R squared will fail to improve. We would therefore leave Position present out of the regression because it isn’t adding any new predictive information to the model and might interfere with the effectiveness of other variables in subtle and strange ways.

Often when I use the word “regression” on someone, what I see reflected back in their eyes is fear. (I really need to reserve that word for people I don’t like.) I wish, though, that people could see that regression is a bit like an automobile: A complex machine with many moving parts, but familiar and approachable, with a simple and comprehensible purpose, and above all operable by just about anyone.