CoolData blog

26 April 2012

For agile data mining, start with the basics

Filed under: Analytics, Pitfalls, Training / Professional Development — Tags: , , , — kevinmacdonell @ 8:56 am

Lately I’ve been telling people that one of the big hurdles to implementing predictive analytics in higher education advancement is the “project mentality.” We too often think of each data mining initiative as a project, something with a beginning and end. We’d be far better off to think in terms of “process” — something iterative, always improving, and never-ending. We also need to think of it as a process with a fairly tight cycle: Deploy it, let it work for a bit, then quickly evaluate, and tweak, or scrap it completely and start over. The whole cycle works over the course of weeks, not months or years.

Here’s how it sometimes goes wrong, in five steps:

  1. Someone has the bright idea to launch a “major donor predictive modelling project.” Fantastic! A committee is struck. They put their heads together and agree on a list of variables that they believe are most likely to be predictive of major giving.
  2. They submit a request to their information management people, or whomever toils in extracting stuff from the database. Emails and phone calls fly back and forth over what EXACTLY THE HECK the data mining team is looking for.
  3. Finally, a massive Excel file is delivered, a thing the likes of which would never exist in nature — like the unstable, man-made elements on the nether fringes of the Periodic Table. More meetings are held to come to agreement about what to do about multiple duplicate rows in the data, and what to do about empty cells. The committee thinks maybe the IT people need to fix the file. Ummm — no!
  4. Half of the data mining team then spends considerable time in pursuit of a data file that gleams in its cleanliness and perfection. The other half is no longer sure what the goal of the project was.
  5. Somehow, a model is created and the records are scored by the one team member left standing. Unfortunately, a year has passed and the person for whom the model was built has left for a new job in California. Her replacement refers to the model as “astrology.”

Allow me a few observations that follow from these five stages:

  1. Successful models are rarely produced by committee, and variables cannot be pre-selected by popular agreement and intuition — although certainly experience is a valuable source of clues.
  2. Submitting requests to someone else for data, having to define exactly what it is you want, and then waiting for the request to be fulfilled — all of that is DEATH to creative data exploration.
  3. A massive, one-time, all-or-nothing data suction job is probably not the ideal starting point. Neither is handling an Excel file with 200,000 rows and a hundred columns.
  4. Perfect data is not a realistic goal, and is not a prerequisite for fruitful data mining.
  5. A year is too long. The cycle has to be much, much tighter than that.

And finally, here are some concrete steps, based on the observations, again point-for-point:

  1. If you’re interested in data mining, try going it alone. Ask for help when you need it, but you’ll make faster progress if you explore on your own or in a team of no more than two or three like-minded people. Don’t tell anyone you’re launching a “project,” and don’t promise deliverables unless you know what you’re doing.
  2. Learn how to build simple queries to pull data from your database. Get IT to set you up. Figure out how to pull a file of IDs along with sum of all their hard-credit giving. Then, pull that AND something else — anything else. Email address, class year, marital status, whatever. Practice, get comfortable with how your data is stored and how to limit it to what you want.
  3. Look into stats software, and learn some of the most common stats terms. Read up on correlation in particular. Build larger files for analysis in the stats software rather than in Excel. Read, read, read. Play, play, play.
  4. Think in terms of pattern detection, and don’t get hung up on the validity of individual data points.
  5. If you’ve done steps 1 to 4, you have the foundations in place for being an agile data miner.

Mind you, it could take considerable time — months, maybe even years — to get really comfortable with the basics, especially if data mining is a sideline to your “real” job.  But success and agility does depend on being able to work independently, being able to snag data on a whim, being able to understand a bit of what is going on in your software, having the freedom to play and explore, and losing notions about data that come from the business analysis and reporting side. In other words, the basics.

Create a free website or blog at WordPress.com.