A couple of years ago there was a discussion on the Prospect-DMM list about the perceived importance of the adjusted R-squared term in building predictive models using multiple regression. What’s the magic number that tells you your model is a good fit for your data?

R-squared is an overall measure of the success of a regression in predicting your dependent variable from your independent variable(s). Adjusted R-squared is the more commonly-cited statistic when you’re using multiple predictors, because it accounts for the number of predictors in the equation (it’s usually lower than your result for non-adjusted R-squared). Data Desk expresses R-squared as a percent, so .345 is the same as 34.5%.

R-squared sometimes gives rise to some mistaken ideas and strange claims, in my opinion. One idea is that when it comes to R-squared, you have to shoot for a very high result, the higher the better. And therefore any predictor variable is good to use as long as it increases R-squared. Perhaps has a result, you’ll sometimes encounter claims of R values up around 60 percent (which I understand can happen) or even 80 percent.

Until I see some accounting for these results I’m taking them with a grain of salt. I’m thrilled when I see R-squared rise from 15% to 20%. My Phonathon model reached 25.4%, which I was very happy with. With a general giving model I can almost reach 40%. This tells me that my regression equation is accounting for, or “explaining”, about 40% of the variability in my DV. In this business, we’re making predictions about human behaviour, not the workings of physical systems, so to get to this level of insight from nothing is a big win.

If someone tells me they reached 40% for their model, I say that’s excellent. At 50%, though, I start to get suspicious. Anything beyond 60%, I just don’t buy at all.

What am I suspicious of? I’m suspicious that their independent variables are just stand-ins for their dependent variable. They are using ‘giving’ to predict ‘giving’ – a basic no-no. For example, I said earlier that my Phonathon model had an adjusted R-squared of 25.4%. Let’s say I create a new variable called ‘has giving’, and that I define this as an indicator variable, so that it has a value of 1 if the person has any giving via the Phonathon, and zero if not. When I put that variable into the regression as a predictor, my adjusted R-squared leaps from 25.4% to 93.0%!

Fantastic, right? **Wrong!** What if you came up with an equation that stated “Y is equal to Y”? Would that be amazing?** No.** It’s true, but it’s not interesting, and it has no predictive value. It’s like walking into a dark alley at night and finding your way using a mirror instead of a flashlight.

It can be more subtle than that. An example … We put on a donor-recognition gala every spring, and we hand out awards to people who reach certain milestones for longevity of giving. Both gala attendance and award status are coded in our database. Combined, these codes pertain to only 1.5% of the population – several hundred individuals. Even though we would expect the effect to be small, adding these two variables as predictors boosts R-squared (adjusted) in my Phonathon model by a full percentage point, to 26.4%.

This is quite significant, considering that by this point my model is mature – it’s full as a tick with variables! But it’s not good news at all. I would never use gala attendance or award status to predict giving, because both variables are merely stand-ins for giving itself. (Peter Wylie refers to them as **‘proxy variables’**.) If my DV were predicting something else – a binary outcome for ‘major donor / not major donor’, say – then maybe you’d consider using one or both of them. But not when the DV is ‘giving’ itself.

If I take care to ensure that my independent variables are indeed not stand-ins for my dependent variable, then I’m going to get lower R as a matter of course. There are all kinds of legitimate ways to obtain a more robust model. Non-anonymous surveying of a broad swath of alumni is one of the best. If you can add all kinds of current, attitude-based data to the historical data already present in the database, I figure you’ll have gone almost as far as possible in modeling this aspect of human behaviour without attaching electrodes to people’s heads. But don’t expect to fit your model to the 60% level; if you are, you’re probably making a big mistake.

You might ask, “Doesn’t this caution about non-independence of predictors apply to a lot of other variables?” For example, it may be that many of the business phone numbers you’ve got in your database are a result of a gift transaction, and therefore the variable is not independent of giving. This is a good point, and there are grounds for debate. I subscribe to the position taken by Josh Birkholz in his book, “Fundraising Analytics.” In his discussion of the issue on page 190, he draws the “use/don’t use” line between variables that exist **solely **because of the behaviour you’re predicting (eg. giving), and variables that exist **partially **because because of the behaviour. Where you’ll draw the line might differ from project to project.

Using ‘business phone present’ as an example: Do 100% of the ‘business phone present’ cases have giving? Probably not. Those numbers probably came from various sources over the years, and they hold genuine predictive power.

So, what’s the magic number that tells you your model is a good fit for your data? I don’t think there’s an answer to that question, because I don’t think you can compare different models using R-squared. My old general models used to reach nearly 40%, but my phonathon models, which reach barely 25% are **FAR superior** in their applicability to the task of predicting what we need to predict.

Use R-squared during your model-building to decide when to stop adding IVs to your regression, and then forget about it. If you want assurances about the effectiveness of your model (and you should), then test against a hold-out sample before you deploy your scores. And then after you deploy, mark a day on the calendar in the future when you will analyze how actual results break down by predictive score.

Back when I was in grad school in the psychology department, one professor had a husband who was a physicist. She said “If R square is over .4, we get suspicious”. He said “if R squared is under .9, we get suspicious”.

:-)

It’s field dependent, I think

Comment by Peter Flom — 19 April 2010 @ 10:57 am

That anecdote illustrates the divide nicely. This couple must have had an interesting relationship. The question is, how could they agree on how significant the relationship was? (*groan*)

Comment by kevinmacdonell — 20 April 2010 @ 7:34 am

LOL I haven’t seen such a poor understanding behind statistical concepts for years!

The language in this article suggest the author’s statistically clueless. First of all you don’t ‘shoot’ for a particular R2. Secondly, judging a low R squared down in the 20-40% level as excellent makes no sense.

If you don’t understand these two statements, you shouldn’t be using statistics. Take the time to understand what R2 actually means and I think you’ll find your article completely ridiculous. Sorry dude.

Comment by Pete — 21 October 2010 @ 8:44 am

Pete,

What’s the discipline that you work in and apply statistics to? I’m guessing it’s not data mining, and it’s DEFINITELY not behavioural prediction if you think you’re going to routinely exceed the 40% level. In predictive modeling we DO look to R-squared for some level of confidence that our predicted values have validity. Not exclusively R-squared, but that’s one of the primary indicators of goodness of fit. More generally, all I can say is that the applications of statistics range as broadly as all human knowledge, and practices and standards vary. Without knowing your area of expertise, I can’t assess how “clueless” I should feel. But thanks for your input!

Comment by kevinmacdonell — 21 October 2010 @ 10:14 am

Hey,

I’m one year late, i guess.

I think it’s the debate between statistically significant and significant with respect to the model studied (which we call in econ economically significant). The huge progresses made in econometrics in the past 30 years have drawn the problem on the problem of endogeneity andd the ways to alleviate this problem. This gave rise to more or less sophisticated techniques, that make the interpretation of the R2 largely irrelevant, because you aim at capturing the “pure” effect of a variable, excluding everything else. Since the real world is pretty complicated (especially when you deal with humans), you end up with a very low R2, but at least you can be quite confident that, if the effect is significant, it exists. The problem is then that you can show the effect of a variable on an outcome, have statistical significance, but have an effect that is so small that it is irrelevant, both in terms of practical issues and research relevance.

We shall never forget the lessons learned from the IV, RDD, diff in diff and random experiment litterature, but I think looking at the Rsquared can be interesting if 1/you’re pretty confident in your model, 2/ you’re interested in actually predicting stuff.

I don’t understand the last sentence of the post: “If you want assurances about the effectiveness of your model (and you should), then test against a hold-out sample before you deploy your scores”

What’s a hold-out sample (i’m not a native speaker…)

Comment by Yannick — 8 May 2011 @ 6:58 pm

Hi Yannick – I find it hard to believe this post is a year old already. It’s never too late to comment, and thanks for your thoughts. I might not go so far as to say that R squared is irrelevant, but perhaps it is nearly so, because I would accept values in a very wide range. I don’t know how low R squared would have to go before I’d reject a model, or how high (although I would start to get nervous at 50%). But R squared isn’t the standard for judging the worth of a model; it only helps me decide when adding more predictor variables is no longer leading to improvements in fitting the data. If all my predictors are significant, then I probably have little reason to care about the actual value of R squared.

A hold-out sample is simply a subset of the data (which includes examples of your target variable) that is set aside (“held out”) and not used to train the model. For example, in a model for propensity to give in response to a phone solicitation, I might hold out a substantial number of phone donors. Then I would create two or more models, using different methods if possible, and see how the held-out cases were scored. (Even though they are held out, the regression will assign predicted values to the cases.) Whichever model does a better job of assigning scores to the holdout sample will be my choice. Even if you make only one model, however, observing how the holdout sample is scored will provide some confidence in the ability of your model to identify prospects.

Comment by kevinmacdonell — 9 May 2011 @ 8:05 pm

Hi and thanks for a good article.

Can you please provide some references on the topic (that for example, an r squared of 0.5 or 0.4 is good in behavioral sciences)?

I want to use it in my thesis. Thanks

Comment by GG — 3 July 2011 @ 2:55 pm

Dear Kevin,

I have found both the article and the discussion useful. May I second this request for references in behavioural sciences?

Many thanks!

Comment by CJM — 5 October 2011 @ 5:27 am

Well – I was not able to quickly put my hands on a citable source that addresses R square in the behavioural sciences, but I would refer you to the text, “Applied Multiple Regression/Correlation Analysis for the Behavioural Sciences,” (Cohen, Cohen, West, Aiken), in which the authors write about the differences between regression as used in the physical sciences (where causal factors may be limited in number, measured in clear-cut units, and where variables are likely to be independent of each other), and the behavioural sciences, where there is much more complexity facing the analyst — numerous causative or correlated variables, complex interactions among variables, factors that are poorly or vaguely measured, etc. etc. Cohen et al. discuss the issue of complexity from several angles in their introduction. They do not address model fit directly, but these factors directly affect the ability of an analyst to fit a model.

Comment by kevinmacdonell — 5 October 2011 @ 12:09 pm

I have a BIG problem, I am preparing a paper for my Quant class, my model is not working and i need help! first i had a low R squared and few significant variables ( VERY FEW), i then found a relevant varaible and I introduced it to the model, the R squared jumped to .5 ( from .06) is that even possible? what does this indicate?? my t statistics are low and thus the individuals variables are not significant but the R 2 of the model is very high! please help

Comment by maha — 22 November 2011 @ 9:16 pm

Diagnosing your issue is a little dangerous without seeing your model output. I’d be curious about your sample size and the distribution of your dependent variable, first of all. If the distribution of your DV is severely skewed, it might benefit from a log transformation or some other transformation. It’s happened to me that I’ve forgotten to log-transform my DV (which in fundraising is frequently skewed) and wondered why so few variables were significant and R squared was so low. Having R squared leap up to .5 points to another issue altogether — you’ve likely introduced a variable which is not independent of your DV. You say it’s “relevant,” but maybe a little too relevant? Anyway — those are only the most obvious possible issues, and you may have already taken them into account. More than that I can’t say without having a look.

Comment by kevinmacdonell — 22 November 2011 @ 9:26 pm

Hey, great read. As an undergrad, I had no idea what ‘r-squared’ meant, and reporting the number that excel just popped out was pretty meaningless. Thanks to this, I understand what it actually means!

Thanks a lot.

Comment by asdf — 29 April 2012 @ 3:57 am