CoolData blog

17 February 2010

Is ‘overfitting’ really a problem?

Filed under: Model building, Pitfalls, Planned Giving, Predictor variables — Tags: , , — kevinmacdonell @ 8:06 am

(Used via Creative Commons license. Click image for source.)

Overfitting describes a condition where your data fits a model “too well”. Your model describes your sample nearly perfectly, but is too rigid to fit any other sample. It isn’t loose enough to serve your predictive needs.

Is this something you ought to worry about? My response is a qualified ‘no’.

First, if your sample is very large, in the many thousands of records, and you’re modeling for a behaviour which is not historically rare (giving to the Annual Fund, for example), then overfit just isn’t an issue. Overfit is something to watch for when you’ve got small sample sizes or your data is limited in some way: building a Planned Giving or Major Giving model based on only a handful of existing cases of the desired behaviour, for example.

Overfit has always sounded like a theoretical problem to me, something that bothers analysts working at some rarefied higher level of modeling refinement. My goal has always been to improve on existing segmenting practices; if the bar is set at “throwing darts at the board,” one is going to be happy with the results of a predictive model, even if it’s wearing a too-restrictive corset.

And yet … doubts crept in.

While creating a model for Planned Giving potential I discovered a characteristic prevalent among our existing expectancies which gave me pause. Many of our existing commitments are from clergy, a number of whom live in retirement on campus. This results from our institution’s history and its traditional association with the Roman Catholic Church. Not surpringly, a name prefix identifying clergy turned out to be a highly predictive variable. Using the variable in the model would have boosted the fit – but at what cost?

Here’s the problem. Elderly clergy members may be the model for past and current expectancies, but I was not confident that the Planned Giving donors of the future would resemble them. Societal changes resulting in a growing distance between church and university was one of the reasons leading me to think that using this variable would be a mistake – this model needed more leeway than that. It took a while for me to make the connection between this gut feeling and the rather abstract concept of ‘overfit’.

This, then, is my advice: Forget about the theory and use common sense – are any of your predictor variables likely to do a much better job describing the reality of the past than that of the future? Don’t overthink it: If your gut’s mostly okay with it, then don’t worry about it. Otherwise, consider sacrificing a little R-squared to get a better model.

About these ads

3 Comments »

  1. Hi,

    Interesting post. Do you really think that overfitting is not an issue? I mean it’s your right but it is quite provocative :-)

    With neural networks for example, one can plot the training/validation error against the number of record used. One can usually see a point where the validation error will be bigger than the training error which shows the overfitting issue.

    Are you saying that this is not important or that it doesn’t happen that much often?

    Regards.
    Sandro.

    Comment by Sandro — 26 February 2010 @ 5:09 pm

    • Ha ha – I don’t deliberately set out to provoke, but it happens! You’ve brought up a good point which I’ve never addressed: Validating the model. My defence first: I’m suggesting that if stats novices (of which I am one) play it fairly safe with our models and are not asking the impossible, then we are going to do reasonably well with the segments we recommend to our fundraisers, overfit or not. By the impossible, I mean, for example, creating a major-gifts model based on the characteristics of five (or twenty-five) major gift donors. I hope I’m not making any generalizations about whether over-fitting is “not important” or “doesn’t happen often.” In fact, I’ve given an example in which I could have risked introducing overfit (by hewing too close to the characteristics of a few donors of the past). But my bottom line is that those of us who are not trained statisticians (and in fundraising that’s most of us) would do well to keep our models relatively simple. And all of us need to use validation for any tricky models that involve relatively uncommon events such as Planned Giving or Major Giving. Thanks, Sandro, for your comment – a future post will have to deal with validation!

      Comment by kevinmacdonell — 26 February 2010 @ 5:52 pm

  2. Kevin, perhaps not in fundraising – but in the world of financial services I have seen it become a significant problem. There are an abundance of available predictors, many of which are high cardinality, and practitioners (with shiny new software tools) are all too keen to throw in that next degree of freedom and show improved (in sample) lift charts to management. I wholeheartedly agree that applying some common sense / gut intuition is a potential solution. Thanks for your post, Mark

    Comment by mhookey — 16 July 2010 @ 12:31 am


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Silver is the New Black Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 1,045 other followers

%d bloggers like this: