CoolData blog

28 September 2010

RapidMiner a powerful, low-budget option for the data savvy

Filed under: Model building, Software — Tags: , — kevinmacdonell @ 7:43 am

Guest Post by Jason Boley, Associate Director of Prospect Management and Tracking, Purdue University

(This is Jason Boley’s second guest post for CoolData. His first, Exploring your database relationships with NodeXL, remains one of the most popular posts on the site. Reviewing software is a departure for CoolData, but I think there’s a big interest in open-source and free software among people interested in learning about data mining. I’m also personally interested in approaches to modeling that I am not familiar with, such as creating decision trees. Looking forward to seeing more offerings in this vein. — Kevin MacDonell.)

Everyone wants to use their resources wisely. Never has this been more important in the not-for-profit sector than now as budgets are limited (or even shrinking). With that in mind, I began searching for a potential shareware solution to help me create decision trees. That’s when I discovered RapidMiner, which is an open source data mining solution developed by the University of Dortmund in Germany.

The community version of RapidMiner is free and is hosted by Sourceforge. However, there is a company, Rapid-I, that provides professional services based around the open source software program. Rapid-I exists to provide professional support and services for those who need it, but the base software is free. There are also a number of free tutorials and training videos available on the Internet.

One of the strengths of RapidMiner is its GUI interface, which is fairly intuitive. RapidMiner executes a series of processes which are defined by dragging and dropping operator boxes to form a string of commands. Take the following example:

In this example, I have created a simple decision tree. In the first operator (Read CSV), I read my CSV file into RapidMiner by defining the directory and data fields. The second operator (Select Attributes) allows me to define my predictive variable. In the third operator (Sample) I have instructed RapidMiner to use only a 2% random sample of my original CSV file in the model. Finally, the final operator (Decision Tree) defines the type of model to execute.

Along the way, you can see the output of each operator, should you choose. For instance, here I’m looking at the metadata view of my raw data (note the additional option of viewing the raw data in ‘Data View’ mode):

(Click to view full size.)

As someone who is not a trained statistician, I can appreciate the effort by RapidMiner to make the end-user view as simple as possible. I was able within a few hours to find my way around relatively easily, import data, perform basic data transformations, and create a basic model.

RapidMiner provides an extensive number of algorithms to choose from. Under the ‘Modeling’ tab you have 113 choices, from trees to clustering, correlation to regression. These choices assume a level of knowledge that might scare off people new to data mining. While documentation exists, it mostly focuses on how to use the software. Don’t expect to come to RapidMiner and learn about statistics – it assumes you know your stuff already.

However, at the end of the day RapidMiner provides a powerful alternative for institutions that have access to individuals versed in statistical analysis but have limited software budgets. I plan to explore some of the more advanced offerings of the software, like text mining, in the near future.

You can find out more about RapidMiner and download the software at the RapidMiner project page: http://rapid-i.com/content/view/181/190/lang,en/ Additionally, a number of free videos and tutorials can be found at the RapidMiner Resources page: http://rapidminerresources.com/

(Jason Boley is the Associate Director of Prospect Management and Tracking at Purdue where he manages prospect information and focuses on data analytics. Jason has over ten years of experience working with fundraising databases and has presented nationally on the topics of database reporting and prospect management.)

About these ads

1 Comment »

  1. While the first version of RapidMiner under the name of YALE (Yet Another Learning Environment) was developed at the Artificial Intelligence and Data Mining Unit of University of Dortmund in Germany, the later versions were all developed by Rapid-I, the company behinde the open source project RapidMiner, founded by the founders of the RapidMiner and YALE open source project:
    http://www.rapid-i.com/

    Today, RapidMiner is the most widely used data mining solutions among data mining experts:
    http://www.kdnuggets.com/polls/2010/data-mining-analytics-tools.html

    RapidMiner provides more than 600 different modules for all kinds of data mining, text mining, web mining, predictive analytics, time series analysis and forecasting as well as ETL, data integration, and reporting tasks. RapidMiner also integrates other data mining, machine learning, and statistics solutions like R and Weka.

    Comment by Frank Xavier — 28 September 2010 @ 7:09 pm


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Silver is the New Black Theme. Create a free website or blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 1,043 other followers

%d bloggers like this: