CoolData blog

19 April 2010

How high, R-squared?

Filed under: Model building, Pitfalls, regression, Terminology — Tags: , — kevinmacdonell @ 10:24 am

A target value for R-squared is not chiseled in stone. (Creative Commons license. Click image for source.)

A couple of years ago there was a discussion on the Prospect-DMM list about the perceived importance of the adjusted R-squared term in building predictive models using multiple regression. What’s the magic number that tells you your model is a good fit for your data?

R-squared is an overall measure of the success of a regression in predicting your dependent variable from your independent variable(s). Adjusted R-squared is the more commonly-cited statistic when you’re using multiple predictors, because it accounts for the number of predictors in the equation (it’s usually lower than your result for non-adjusted R-squared). Data Desk expresses R-squared as a percent, so .345 is the same as 34.5%.

R-squared sometimes gives rise to some mistaken ideas and strange claims, in my opinion. One idea is that when it comes to R-squared, you have to shoot for a very high result, the higher the better. And therefore any predictor variable is good to use as long as it increases R-squared. Perhaps has a result, you’ll sometimes encounter claims of R values up around 60 percent (which I understand can happen) or even 80 percent.

Until I see some accounting for these results I’m taking them with a grain of salt. I’m thrilled when I see R-squared rise from 15% to 20%. My Phonathon model reached 25.4%, which I was very happy with. With a general giving model I can almost reach 40%. This tells me that my regression equation is accounting for, or “explaining”, about 40% of the variability in my DV. In this business, we’re making predictions about human behaviour, not the workings of physical systems, so to get to this level of insight from nothing is a big win.

If someone tells me they reached 40% for their model, I say that’s excellent. At 50%, though, I start to get suspicious. Anything beyond 60%, I just don’t buy at all.

What am I suspicious of? I’m suspicious that their independent variables are just stand-ins for their dependent variable. They are using ‘giving’ to predict ‘giving’ – a basic no-no. For example, I said earlier that my Phonathon model had an adjusted R-squared of 25.4%. Let’s say I create a new variable called ‘has giving’, and that I define this as an indicator variable, so that it has a value of 1 if the person has any giving via the Phonathon, and zero if not. When I put that variable into the regression as a predictor, my adjusted R-squared leaps from 25.4% to 93.0%!

Fantastic, right? Wrong! What if you came up with an equation that stated “Y is equal to Y”? Would that be amazing? No. It’s true, but it’s not interesting, and it has no predictive value. It’s like walking into a dark alley at night and finding your way using a mirror instead of a flashlight.

It can be more subtle than that. An example … We put on a donor-recognition gala every spring, and we hand out awards to people who reach certain milestones for longevity of giving. Both gala attendance and award status are coded in our database. Combined, these codes pertain to only 1.5% of the population – several hundred individuals. Even though we would expect the effect to be small, adding these two variables as predictors boosts R-squared (adjusted) in my Phonathon model by a full percentage point, to 26.4%.

This is quite significant, considering that by this point my model is mature – it’s full as a tick with variables! But it’s not good news at all. I would never use gala attendance or award status to predict giving, because both variables are merely stand-ins for giving itself. (Peter Wylie refers to them as ‘proxy variables’.) If my DV were predicting something else – a binary outcome for ‘major donor / not major donor’, say – then maybe you’d consider using one or both of them. But not when the DV is ‘giving’ itself.

If I take care to ensure that my independent variables are indeed not stand-ins for my dependent variable, then I’m going to get lower R as a matter of course. There are all kinds of legitimate ways to obtain a more robust model. Non-anonymous surveying of a broad swath of alumni is one of the best. If you can add all kinds of current, attitude-based data to the historical data already present in the database, I figure you’ll have gone almost as far as possible in modeling this aspect of human behaviour without attaching electrodes to people’s heads. But don’t expect to fit your model to the 60% level; if you are, you’re probably making a big mistake.

You might ask, “Doesn’t this caution about non-independence of predictors apply to a lot of other variables?” For example, it may be that many of the business phone numbers you’ve got in your database are a result of a gift transaction, and therefore the variable is not independent of giving. This is a good point, and there are grounds for debate. I subscribe to the position taken by Josh Birkholz in his book, “Fundraising Analytics.” In his discussion of the issue on page 190, he draws the “use/don’t use” line between variables that exist solely because of the behaviour you’re predicting (eg. giving), and variables that exist partially because because of the behaviour. Where you’ll draw the line might differ from project to project.

Using ‘business phone present’ as an example: Do 100% of the ‘business phone present’ cases have giving? Probably not. Those numbers probably came from various sources over the years, and they hold genuine predictive power.

So, what’s the magic number that tells you your model is a good fit for your data? I don’t think there’s an answer to that question, because I don’t think you can compare different models using R-squared. My old general models used to reach nearly 40%, but my phonathon models, which reach barely 25% are FAR superior in their applicability to the task of predicting what we need to predict.

Use R-squared during your model-building to decide when to stop adding IVs to your regression, and then forget about it. If you want assurances about the effectiveness of your model (and you should), then test against a hold-out sample before you deploy your scores. And then after you deploy, mark a day on the calendar in the future when you will analyze how actual results break down by predictive score.

Blog at