Guest Post by Jason Boley, Associate Director of Prospect Management and Tracking, Purdue University
(This is Jason Boley’s second guest post for CoolData. His first, Exploring your database relationships with NodeXL, remains one of the most popular posts on the site. Reviewing software is a departure for CoolData, but I think there’s a big interest in open-source and free software among people interested in learning about data mining. I’m also personally interested in approaches to modeling that I am not familiar with, such as creating decision trees. Looking forward to seeing more offerings in this vein. — Kevin MacDonell.)
Everyone wants to use their resources wisely. Never has this been more important in the not-for-profit sector than now as budgets are limited (or even shrinking). With that in mind, I began searching for a potential shareware solution to help me create decision trees. That’s when I discovered RapidMiner, which is an open source data mining solution developed by the University of Dortmund in Germany.
The community version of RapidMiner is free and is hosted by Sourceforge. However, there is a company, Rapid-I, that provides professional services based around the open source software program. Rapid-I exists to provide professional support and services for those who need it, but the base software is free. There are also a number of free tutorials and training videos available on the Internet.
One of the strengths of RapidMiner is its GUI interface, which is fairly intuitive. RapidMiner executes a series of processes which are defined by dragging and dropping operator boxes to form a string of commands. Take the following example:
In this example, I have created a simple decision tree. In the first operator (Read CSV), I read my CSV file into RapidMiner by defining the directory and data fields. The second operator (Select Attributes) allows me to define my predictive variable. In the third operator (Sample) I have instructed RapidMiner to use only a 2% random sample of my original CSV file in the model. Finally, the final operator (Decision Tree) defines the type of model to execute.
Along the way, you can see the output of each operator, should you choose. For instance, here I’m looking at the metadata view of my raw data (note the additional option of viewing the raw data in ‘Data View’ mode):
As someone who is not a trained statistician, I can appreciate the effort by RapidMiner to make the end-user view as simple as possible. I was able within a few hours to find my way around relatively easily, import data, perform basic data transformations, and create a basic model.
RapidMiner provides an extensive number of algorithms to choose from. Under the ‘Modeling’ tab you have 113 choices, from trees to clustering, correlation to regression. These choices assume a level of knowledge that might scare off people new to data mining. While documentation exists, it mostly focuses on how to use the software. Don’t expect to come to RapidMiner and learn about statistics – it assumes you know your stuff already.
However, at the end of the day RapidMiner provides a powerful alternative for institutions that have access to individuals versed in statistical analysis but have limited software budgets. I plan to explore some of the more advanced offerings of the software, like text mining, in the near future.
You can find out more about RapidMiner and download the software at the RapidMiner project page: http://rapid-i.com/content/view/181/190/lang,en/ Additionally, a number of free videos and tutorials can be found at the RapidMiner Resources page: http://rapidminerresources.com/
(Jason Boley is the Associate Director of Prospect Management and Tracking at Purdue where he manages prospect information and focuses on data analytics. Jason has over ten years of experience working with fundraising databases and has presented nationally on the topics of database reporting and prospect management.)