CoolData blog

18 April 2012

Stepwise, model-foolish?

Filed under: Model building, Pitfalls, regression, Software, Statistics — Tags: , — kevinmacdonell @ 8:00 am

My approach to building predictive models using multiple linear regression might seem plodding to some. I add predictor variables to the regression one by one, instead of using stepwise methods. Even though the number of predictor variables I use has greatly increased, and the time needed to build a model has lengthened, I am even less likely to use stepwise regression today than I was a few years ago.

Stepwise regression, available in most stats software, tosses all the predictor variables into the analysis at once and picks the best for you. It’s a semi-automated process that can work forwards or backwards, adding or deleting variables until it’s satisfied a statistical rule of thumb. The software should give you some control over the process, but mostly your computer is making all the big decisions.

I understand the allure. We’re all looking for ways to save time, and generally anything that automates a repetitive process is a good thing. Given a hundred variables to choose from, I wouldn’t be surprised if my software was able to get a better-fitting model than I could produce on my own.

But in this case, it’s not for me.

Building a decent model isn’t just about getting a good fit in terms of high R square. That statistic tells you how well the model fits the data that the model was built on — not data the model hasn’t yet seen, which is where the model does its work (or doesn’t). The true worth of the model is revealed only over time, but you’re more likely to succeed if you’ve applied your knowledge and judgement to variable selection. I tend to add variables one by one in order of their Pearson correlation with the target variable, but I am also aware of groups of variables that are highly correlated with each other and likely to cause issues. The process is not so repetitive that it can always be automated. Stepwise regression is more apt to select a lot of trivial variables with overlapping effects and ignore a significant predictor that I know will do the job better.

Or so I suspect. My avoidance of stepwise regression has always been due to a vague antipathy rather than anything based on sound technical concerns. This collection of thoughts I came across recently lent some justification of this undefined feeling: Problems with stepwise regression. Some of the authors’ concerns are indeed technical, but the ones that resonated the most for me boiled down to this: Automated variable selection divorces the modeler from the process so that he or she is less likely to learn things about their data. It’s just not as much fun when you’re not making the selections yourself, and you’re not getting a feel for the relationships in your data.

Stepwise regression may hold appeal for beginning modellers, especially those looking for push-button results. I can’t deny that software for predictive analysis is getting better and better at automating some of the most tedious aspects of model-building, particularly in preparing and cleaning the data. But for any modeller, especially one working with unfamiliar data, nothing beats adding and removing variables one at a time, by hand.


Blog at