CoolData blog

26 April 2012

For agile data mining, start with the basics

Filed under: Analytics, Pitfalls, Training / Professional Development — Tags: , , , — kevinmacdonell @ 8:56 am

Lately I’ve been telling people that one of the big hurdles to implementing predictive analytics in higher education advancement is the “project mentality.” We too often think of each data mining initiative as a project, something with a beginning and end. We’d be far better off to think in terms of “process” — something iterative, always improving, and never-ending. We also need to think of it as a process with a fairly tight cycle: Deploy it, let it work for a bit, then quickly evaluate, and tweak, or scrap it completely and start over. The whole cycle works over the course of weeks, not months or years.

Here’s how it sometimes goes wrong, in five steps:

  1. Someone has the bright idea to launch a “major donor predictive modelling project.” Fantastic! A committee is struck. They put their heads together and agree on a list of variables that they believe are most likely to be predictive of major giving.
  2. They submit a request to their information management people, or whomever toils in extracting stuff from the database. Emails and phone calls fly back and forth over what EXACTLY THE HECK the data mining team is looking for.
  3. Finally, a massive Excel file is delivered, a thing the likes of which would never exist in nature — like the unstable, man-made elements on the nether fringes of the Periodic Table. More meetings are held to come to agreement about what to do about multiple duplicate rows in the data, and what to do about empty cells. The committee thinks maybe the IT people need to fix the file. Ummm — no!
  4. Half of the data mining team then spends considerable time in pursuit of a data file that gleams in its cleanliness and perfection. The other half is no longer sure what the goal of the project was.
  5. Somehow, a model is created and the records are scored by the one team member left standing. Unfortunately, a year has passed and the person for whom the model was built has left for a new job in California. Her replacement refers to the model as “astrology.”

Allow me a few observations that follow from these five stages:

  1. Successful models are rarely produced by committee, and variables cannot be pre-selected by popular agreement and intuition — although certainly experience is a valuable source of clues.
  2. Submitting requests to someone else for data, having to define exactly what it is you want, and then waiting for the request to be fulfilled — all of that is DEATH to creative data exploration.
  3. A massive, one-time, all-or-nothing data suction job is probably not the ideal starting point. Neither is handling an Excel file with 200,000 rows and a hundred columns.
  4. Perfect data is not a realistic goal, and is not a prerequisite for fruitful data mining.
  5. A year is too long. The cycle has to be much, much tighter than that.

And finally, here are some concrete steps, based on the observations, again point-for-point:

  1. If you’re interested in data mining, try going it alone. Ask for help when you need it, but you’ll make faster progress if you explore on your own or in a team of no more than two or three like-minded people. Don’t tell anyone you’re launching a “project,” and don’t promise deliverables unless you know what you’re doing.
  2. Learn how to build simple queries to pull data from your database. Get IT to set you up. Figure out how to pull a file of IDs along with sum of all their hard-credit giving. Then, pull that AND something else — anything else. Email address, class year, marital status, whatever. Practice, get comfortable with how your data is stored and how to limit it to what you want.
  3. Look into stats software, and learn some of the most common stats terms. Read up on correlation in particular. Build larger files for analysis in the stats software rather than in Excel. Read, read, read. Play, play, play.
  4. Think in terms of pattern detection, and don’t get hung up on the validity of individual data points.
  5. If you’ve done steps 1 to 4, you have the foundations in place for being an agile data miner.

Mind you, it could take considerable time — months, maybe even years — to get really comfortable with the basics, especially if data mining is a sideline to your “real” job.  But success and agility does depend on being able to work independently, being able to snag data on a whim, being able to understand a bit of what is going on in your software, having the freedom to play and explore, and losing notions about data that come from the business analysis and reporting side. In other words, the basics.



  1. Compiling data outside of excel is hugely important. When you have an especially big data set (> 300,000 rows and hundreds of variables), even manipulating data in Excel is bothersome. Use database software as much as you can, and use R for the rest 🙂

    Comment by inkhorn82 — 27 April 2012 @ 12:27 pm

  2. 4.Perfect data is not a realistic goal, and is not a prerequisite for fruitful data mining.

    I think this is the most important of your observations. You are correct in that gift officers are often lone wolves in their visions of which variables they are looking to shape their “perfect” list. That said, it was my experience in my previous job that a small group of gift officers and researchers were able to come to rough consensus on 23 variables in our database that created a basic affinity score (from 1 to 6). It wasn’t “perfect”, but linking that score with our capacity rating then gave each gift officer an excellent starting point for pursuing their own “perfect” lists. For example, I could expand beyond the score by looking for alumni over age 50 who live in NY with an executive job title who have an affinity score above 4. Or maybe I’m a planned giving officer looking for alumni over the age of 65, who have made gifts in 7 of 10 years and have an affinity score above 5.

    It is my opinion that larger institutions would benefit from the creation of a “good” baseline score like this that is determined by a thoughtful conversation between their most creative gift officers and their best researchers/programmers. The other side of that equation is that most gift officers, like most human beings, are not effective analysts of data… their preferences change quickly, they draw incorrect conclusions and they do not effectively utilize the data that is presented to them. So to the degree that you can provide a “good” baseline score, the more likely you will minimize the errors that they will make and the time that they will waste examining piles and piles of excel lists.

    Comment by Steve — 3 May 2012 @ 5:46 pm

    • Steve – A definite advantage of involving gift officers in the variable selection process is that they will feel that they “own” the model. They understand what went into it, and they trust it — and therefore they will use it. Having a committee job produce a model that is actually used is superior to having a model that no one uses. However, did anyone ever go back to evaluate whether those 23 committee-chosen variables were actually valid correlates of giving? It still seems that process was primarily driven by intuition and opinion … I think fundraiser experience can provide valid ideas for things to test, but final weighting and scoring should be a purely data-driven exercise.

      Comment by kevinmacdonell — 6 May 2012 @ 6:34 am

      • Kevin – absolutely the affinity model was a data-driven conversation between development and data analysts. We repeatedly tested our assumptions and looked at how the model scored cases that we knew were proven good or bad prospects. There was ultimately a good fit in the results and we had a large pool of prospects that were underweighted in the attention that they were receiving.

        I think that there is also a distinction for large institutions (like my former employer) that have hundreds of thousands of alumni and institutions that have smaller pools of alumni or are just beginning in their development efforts. For the former, the model isn’t so much about predicting new classes/types of donors, but rather about trying to reduce the background noise and finding the best prospects among people who are already giving, or are engaged.

        Comment by Steve — 7 May 2012 @ 9:52 am

  3. Many great, insightful nuggets in this post! I think a lot of the talk about “how to build an in-house modeling program” is missing this perspective. There’s so much focus on techniques and the hands-on component, and we don’t spend as much time as we should talking about organizational (and political?) pitfalls and how to deal with them. Thanks for taking the conversation in that direction.

    This post is a must-read for anyone trying to do more modeling/mining in a small shop!

    Comment by Mark Egge — 10 May 2012 @ 10:30 am

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at

%d bloggers like this: