CoolData blog

18 April 2012

Stepwise, model-foolish?

Filed under: Model building, Pitfalls, regression, Software, Statistics — Tags: , — kevinmacdonell @ 8:00 am

My approach to building predictive models using multiple linear regression might seem plodding to some. I add predictor variables to the regression one by one, instead of using stepwise methods. Even though the number of predictor variables I use has greatly increased, and the time needed to build a model has lengthened, I am even less likely to use stepwise regression today than I was a few years ago.

Stepwise regression, available in most stats software, tosses all the predictor variables into the analysis at once and picks the best for you. It’s a semi-automated process that can work forwards or backwards, adding or deleting variables until it’s satisfied a statistical rule of thumb. The software should give you some control over the process, but mostly your computer is making all the big decisions.

I understand the allure. We’re all looking for ways to save time, and generally anything that automates a repetitive process is a good thing. Given a hundred variables to choose from, I wouldn’t be surprised if my software was able to get a better-fitting model than I could produce on my own.

But in this case, it’s not for me.

Building a decent model isn’t just about getting a good fit in terms of high R square. That statistic tells you how well the model fits the data that the model was built on — not data the model hasn’t yet seen, which is where the model does its work (or doesn’t). The true worth of the model is revealed only over time, but you’re more likely to succeed if you’ve applied your knowledge and judgement to variable selection. I tend to add variables one by one in order of their Pearson correlation with the target variable, but I am also aware of groups of variables that are highly correlated with each other and likely to cause issues. The process is not so repetitive that it can always be automated. Stepwise regression is more apt to select a lot of trivial variables with overlapping effects and ignore a significant predictor that I know will do the job better.

Or so I suspect. My avoidance of stepwise regression has always been due to a vague antipathy rather than anything based on sound technical concerns. This collection of thoughts I came across recently lent some justification of this undefined feeling: Problems with stepwise regression. Some of the authors’ concerns are indeed technical, but the ones that resonated the most for me boiled down to this: Automated variable selection divorces the modeler from the process so that he or she is less likely to learn things about their data. It’s just not as much fun when you’re not making the selections yourself, and you’re not getting a feel for the relationships in your data.

Stepwise regression may hold appeal for beginning modellers, especially those looking for push-button results. I can’t deny that software for predictive analysis is getting better and better at automating some of the most tedious aspects of model-building, particularly in preparing and cleaning the data. But for any modeller, especially one working with unfamiliar data, nothing beats adding and removing variables one at a time, by hand.


28 September 2010

RapidMiner a powerful, low-budget option for the data savvy

Filed under: Model building, Software — Tags: , — kevinmacdonell @ 7:43 am

Guest Post by Jason Boley, Associate Director of Prospect Management and Tracking, Purdue University

(This is Jason Boley’s second guest post for CoolData. His first, Exploring your database relationships with NodeXL, remains one of the most popular posts on the site. Reviewing software is a departure for CoolData, but I think there’s a big interest in open-source and free software among people interested in learning about data mining. I’m also personally interested in approaches to modeling that I am not familiar with, such as creating decision trees. Looking forward to seeing more offerings in this vein. — Kevin MacDonell.)

Everyone wants to use their resources wisely. Never has this been more important in the not-for-profit sector than now as budgets are limited (or even shrinking). With that in mind, I began searching for a potential shareware solution to help me create decision trees. That’s when I discovered RapidMiner, which is an open source data mining solution developed by the University of Dortmund in Germany.

The community version of RapidMiner is free and is hosted by Sourceforge. However, there is a company, Rapid-I, that provides professional services based around the open source software program. Rapid-I exists to provide professional support and services for those who need it, but the base software is free. There are also a number of free tutorials and training videos available on the Internet.

One of the strengths of RapidMiner is its GUI interface, which is fairly intuitive. RapidMiner executes a series of processes which are defined by dragging and dropping operator boxes to form a string of commands. Take the following example:

In this example, I have created a simple decision tree. In the first operator (Read CSV), I read my CSV file into RapidMiner by defining the directory and data fields. The second operator (Select Attributes) allows me to define my predictive variable. In the third operator (Sample) I have instructed RapidMiner to use only a 2% random sample of my original CSV file in the model. Finally, the final operator (Decision Tree) defines the type of model to execute.

Along the way, you can see the output of each operator, should you choose. For instance, here I’m looking at the metadata view of my raw data (note the additional option of viewing the raw data in ‘Data View’ mode):

(Click to view full size.)

As someone who is not a trained statistician, I can appreciate the effort by RapidMiner to make the end-user view as simple as possible. I was able within a few hours to find my way around relatively easily, import data, perform basic data transformations, and create a basic model.

RapidMiner provides an extensive number of algorithms to choose from. Under the ‘Modeling’ tab you have 113 choices, from trees to clustering, correlation to regression. These choices assume a level of knowledge that might scare off people new to data mining. While documentation exists, it mostly focuses on how to use the software. Don’t expect to come to RapidMiner and learn about statistics – it assumes you know your stuff already.

However, at the end of the day RapidMiner provides a powerful alternative for institutions that have access to individuals versed in statistical analysis but have limited software budgets. I plan to explore some of the more advanced offerings of the software, like text mining, in the near future.

You can find out more about RapidMiner and download the software at the RapidMiner project page:,en/ Additionally, a number of free videos and tutorials can be found at the RapidMiner Resources page:

(Jason Boley is the Associate Director of Prospect Management and Tracking at Purdue where he manages prospect information and focuses on data analytics. Jason has over ten years of experience working with fundraising databases and has presented nationally on the topics of database reporting and prospect management.)

Blog at