CoolData blog

18 April 2012

Stepwise, model-foolish?

Filed under: Model building, Pitfalls, regression, Software, Statistics — Tags: , — kevinmacdonell @ 8:00 am

My approach to building predictive models using multiple linear regression might seem plodding to some. I add predictor variables to the regression one by one, instead of using stepwise methods. Even though the number of predictor variables I use has greatly increased, and the time needed to build a model has lengthened, I am even less likely to use stepwise regression today than I was a few years ago.

Stepwise regression, available in most stats software, tosses all the predictor variables into the analysis at once and picks the best for you. It’s a semi-automated process that can work forwards or backwards, adding or deleting variables until it’s satisfied a statistical rule of thumb. The software should give you some control over the process, but mostly your computer is making all the big decisions.

I understand the allure. We’re all looking for ways to save time, and generally anything that automates a repetitive process is a good thing. Given a hundred variables to choose from, I wouldn’t be surprised if my software was able to get a better-fitting model than I could produce on my own.

But in this case, it’s not for me.

Building a decent model isn’t just about getting a good fit in terms of high R square. That statistic tells you how well the model fits the data that the model was built on — not data the model hasn’t yet seen, which is where the model does its work (or doesn’t). The true worth of the model is revealed only over time, but you’re more likely to succeed if you’ve applied your knowledge and judgement to variable selection. I tend to add variables one by one in order of their Pearson correlation with the target variable, but I am also aware of groups of variables that are highly correlated with each other and likely to cause issues. The process is not so repetitive that it can always be automated. Stepwise regression is more apt to select a lot of trivial variables with overlapping effects and ignore a significant predictor that I know will do the job better.

Or so I suspect. My avoidance of stepwise regression has always been due to a vague antipathy rather than anything based on sound technical concerns. This collection of thoughts I came across recently lent some justification of this undefined feeling: Problems with stepwise regression. Some of the authors’ concerns are indeed technical, but the ones that resonated the most for me boiled down to this: Automated variable selection divorces the modeler from the process so that he or she is less likely to learn things about their data. It’s just not as much fun when you’re not making the selections yourself, and you’re not getting a feel for the relationships in your data.

Stepwise regression may hold appeal for beginning modellers, especially those looking for push-button results. I can’t deny that software for predictive analysis is getting better and better at automating some of the most tedious aspects of model-building, particularly in preparing and cleaning the data. But for any modeller, especially one working with unfamiliar data, nothing beats adding and removing variables one at a time, by hand.



  1. I’ve had experiences with both of these and agree that I like adding them manually as well. I like the control. However, I’ve also had times where adding them manually can get really confusing. I think we all know that adding a variable to any existing sets can have a cascading impact that was unintneded which may link back to a relation with a variable you added much earlier in the process. So figuring that out can be extremely time-consuming. But I’ve also had times where my manually built model was more solid than the step-wise. I guess it just depends on what you have time and energy to do. I do think step-wise is nice though when you need to limit the universe of choices.

    Comment by Jason Boley — 18 April 2012 @ 9:03 am

    • Jason – Yes, it has to do with a feeling of control for me as well, along with it being more fun. I don’t deal with hundreds of inputs, so the manual process is manageable. Other commenters here have suggested using stepwise for exploratory work prior to actually building the model, which I think is a great idea. Oh – prior to adding any variables to a regression model, I have a look at a Pearson correlation matrix of all the variables to identify pairs which may be highly correlated and will cause the ‘cascading impact’ you mention. Not perfect, because interactions can be a lot more complex than two-way and cannot be anticipated, but if I do see two highly correlated predictors I will know in advance that I will want to keep only one of them — or combine them somehow.

      Comment by kevinmacdonell — 6 May 2012 @ 6:42 am

  2. Hi Kevin,
    I recently watched a webinar in which cleaning tools were promoted and I’m wondering how effective an automated function could be – for similar reasons to your argument against using stepwise. In my experience, the cleaning process was tedious, laborious, time-consumiing, and sometimes frustrating, so the idea is appealing, but seems to good to be true. What are your experiences with the automated cleaning functions?

    Comment by Kelly Heinrich — 18 April 2012 @ 9:15 am

    • Kelly – I have only seen data cleaning tools demonstrated, I’ve never actually used any. I agree that preparing the data can be incredibly tedious, and some automated help is of great advantage — as long as the software suggests options for you to pick from or the ability to override an automated decision. That doesn’t concern me so much. It’s the actual model-building that needs your control and supervision. (In my opinion.)

      On the other hand, I’ve learned a lot about the peculiarities of our data from having to clean it up manually, and asking questions about what certain codes mean.

      Comment by kevinmacdonell — 6 May 2012 @ 6:45 am

  3. I also like to mix up variables and see what works instead of just trusting the software. Although I’m likely to use stepwise regression as a way of testing categorical variables in a binary logistic regression, and then I go on to test them in other ways. I don’t use stepwise for my final model.

    Comment by Marianne Pelletier — 18 April 2012 @ 10:32 am

  4. What are you predicting? Likelihood to donate? Amount donated? What kind of regression are you using? OLS? Logistic Regression? Just curious 🙂

    Comment by Matthew Dubins — 18 April 2012 @ 6:56 pm

    • My comments apply to any sort of model response, although most often my DV is a continuous variable that is ‘giving’ (lifetime giving, giving by phone, etc – depending on the purpose of the model). I’m using OLS – multiple linear regression. I will also use logistic, but these comments apply mainly to multiple regression. I almost always create both types of models (if I can formulate a binary DV that makes sense), and test them against each other with a validation data set – multiple regression almost always wins.

      Comment by kevinmacdonell — 6 May 2012 @ 6:48 am

  5. I agree that if you’ve got one dataset and that’s it, that stepwise isn’t the way to go. Variables have to chosen based on likelihood to succeed and logic. But, if you’ve got multiple datasets and confirmatory processes are possible, I love a good ‘ol stepwise. A surprise may just be in store and you might not find it otherwise.

    Comment by LoveStats — 18 April 2012 @ 8:01 pm

    • Maybe stepwise is worth another look. I almost always am working with a single data set, but always interested in other approaches.

      Comment by kevinmacdonell — 6 May 2012 @ 6:52 am

  6. Oh, and i love the title of this post. 🙂

    Comment by LoveStats — 18 April 2012 @ 8:02 pm

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create a free website or blog at

%d bloggers like this: