Overfitting describes a condition where your data fits a model “too well”. Your model describes your sample nearly perfectly, but is too rigid to fit any other sample. It isn’t loose enough to serve your predictive needs.
Is this something you ought to worry about? My response is a qualified ‘no’.
First, if your sample is very large, in the many thousands of records, and you’re modeling for a behaviour which is not historically rare (giving to the Annual Fund, for example), then overfit just isn’t an issue. Overfit is something to watch for when you’ve got small sample sizes or your data is limited in some way: building a Planned Giving or Major Giving model based on only a handful of existing cases of the desired behaviour, for example.
Overfit has always sounded like a theoretical problem to me, something that bothers analysts working at some rarefied higher level of modeling refinement. My goal has always been to improve on existing segmenting practices; if the bar is set at “throwing darts at the board,” one is going to be happy with the results of a predictive model, even if it’s wearing a too-restrictive corset.
And yet … doubts crept in.
While creating a model for Planned Giving potential I discovered a characteristic prevalent among our existing expectancies which gave me pause. Many of our existing commitments are from clergy, a number of whom live in retirement on campus. This results from our institution’s history and its traditional association with the Roman Catholic Church. Not surpringly, a name prefix identifying clergy turned out to be a highly predictive variable. Using the variable in the model would have boosted the fit – but at what cost?
Here’s the problem. Elderly clergy members may be the model for past and current expectancies, but I was not confident that the Planned Giving donors of the future would resemble them. Societal changes resulting in a growing distance between church and university was one of the reasons leading me to think that using this variable would be a mistake – this model needed more leeway than that. It took a while for me to make the connection between this gut feeling and the rather abstract concept of ‘overfit’.
This, then, is my advice: Forget about the theory and use common sense – are any of your predictor variables likely to do a much better job describing the reality of the past than that of the future? Don’t overthink it: If your gut’s mostly okay with it, then don’t worry about it. Otherwise, consider sacrificing a little R-squared to get a better model.