CoolData blog

1 April 2015

Mind the data science gap

Filed under: Training / Professional Development — Tags: , , — kevinmacdonell @ 8:10 pm

 

Being a forward-thinking lot, the data-obsessed among us are always pondering the best next step to take in professional development. There are more options every day, from a Data Science track on Coursera to new masters degree programs in predictive analytics. I hear a lot of talk about acquiring skills in R, machine learning, and advanced modelling techniques.

 

All to the good, in general. What university or large non-profit wouldn’t benefit from having a highly-trained, triple-threat chameleon with statistics, programming, and data analytics skills? I think it’s great that people are investing serious time and brain cells pursuing their passion for data analysis.

 

And yet, one has to wonder, are these advanced courses and tools helping drive bottom-line results across the sector? Are they helping people at nonprofits and university advancement offices do a better job of analyzing their data toward some useful end?

 

I have a few doubts. The institutions and causes that employ these enterprising learners may be fortunate to have them, but I would worry about retention. Wouldn’t these rock stars eventually feel constrained in the nonprofit or higher ed world? It’s a great place to apply one’s creativity, but aren’t the problems and applications one can address with data in our field relatively straightforward in comparison with other fields? (Tailoring medical treatment to an individual’s DNA, preventing terrorism or bank fraud, getting an American president elected?) And then there’s the pay.

 

Maybe I’m wrong to think so. Clearly there are talented people working in our sector who are here because they have found the perfect combination of passions. They want to be here.

 

Anyway — rock star retention is not my biggest concern.

 

I’m more concerned about the rest of us: people who want to make better use of data, but aren’t planning to learn way more than we need or are capable of. I’m concerned for a couple of reasons.

 

First, many of the professional development options available are pitched at a level too advanced to be practical for organizations who haven’t hired a full-time predictive analytics specialist. The majority of professionals working in the non-profit and higher-ed sectors are mainly interested in getting better at their jobs, whether that’s increasing dollars raised or boosting engagement among their communities. They don’t need to learn to code. They do need some basic, solid training options. I’m not sure these are easy to spot among all the competing offerings and (let’s be honest) the Big Data hype.

 

These people need support and appropriate training. There’s a place for scripting and machine learning, but let’s ensure we are already up to speed on means/medians, bar charts, basic scoring, correlation, and regression. Sexy? No. But useful, powerful, necessary. Relatively simple and manual techniques that are accessible to a range of advancement professionals — not just the highly technical — offer a high return on investment. It would be a shame if the majority were cowed into thinking that data analysis isn’t for them just because they don’t see what neural networks have to do with their day to day work.

 

My second concern is that some of the advanced tools of data science are deceptively easy to use. I read an article recently that stated that when it’s done really well, data science looks easy. That’s a problem. A machine-learning algorithm will spit out answers, but are they worth anything? (Maybe.) Does an analyst learn anything about their data by tweaking the knobs on a black box? (Probably not.) Is skipping over the inconvenience of manual data exploration detrimental to gaining valuable insights? (Yes!)

 

Don’t get me wrong — I think R, Python, and other tools are extremely useful for predictive modelling, although not for doing the modelling itself (not in my hands, at least). I use SQL and Python to automate the assembly of large data files to feed into Data Desk — it’s so nice to push a button and have the script merge together data from the database, from our phonathon database, from our broadcast email platform and other sources, as well as automatically create certain indicator variables, pivoting all kinds of categorical variables and handling missing data elegantly. Preparing this file using more manual methods would take days.

 

But this doesn’t automate exploration of the data, it doesn’t remove the need to be careful about preparing data to answer the business question, and it does absolutely nothing to help define that business question. Rather than let a script grind unsupervised through the data to spit out a result seconds later without any subject-matter expertise being applied, the real work of building a model is still done manually, in Data Desk, and right now I doubt there is a better way.

 

When it comes to professional development, then, all I can say is, “to each their own.” There is no one best route. The important thing is to ensure that motivated professionals are matched to training that is a good fit with their aptitudes and with the real needs of the organization.

 

Advertisements

6 May 2011

Wanted: More ways to learn predictive modeling

Filed under: Peter Wylie, Training / Professional Development — Tags: , — kevinmacdonell @ 5:11 am

I remember the first time I opened up the statistics software package I now use to build predictive models. I had read Peter Wylie’s book, Data Mining for Fundaisers, so I had the basic idea in my head, plus a dose of Peter’s larger-than-life enthusiasm. The next step was to download a trial version of Data Desk to see if I could apply what I’d read to some of my own data. But I was a long way off from knowing how to build my first model.

Here’s what I saw:

It was a tabula rasa. Much like my brain. Exciting things may come from these blank-slate moments, but not this time — I had no idea what to do first. I clicked on some of the menus, like the one below, which didn’t help. Even after loading my data, a simple paste operation from Excel, I was missing the “now do this” element.

So I did what many others have done with a stats package they’ve looked at for the first time — I closed and uninstalled it. (I’ve done the same with SPSS, Minitab and other programs.) I could have tinkered with it and made some progress on my own, but I had pressing work to do. Data mining was a personal interest, not a priority. It wasn’t the latest crisis du jour and therefore it wasn’t “work”.

I don’t blame the software. Help files and manuals can be quite good. But most good software is capable of doing a lot more than just the one task ones seeks to carry out; the manual will be more general and comprehensive than required. Translating Peter’s straightforward method to precise steps in Data Desk required me to isolate those functions in the software, and I had no luck with that. As well, the manual was full of stats terms I was not familiar with.

Fortunately the story didn’t end there. Peter himself, aware of my interest, worked with me to show how I could get smart about using our data. Thus armed, I was able to convince my manager that we needed to invest in one-on-one training.

What did training accomplish that working on my own could not?

  • One, the training was couched in the language of fundraising, not statistics. Terms from statistics were introduced as needed, and selectively. A comprehensive understanding of stats was not the goal.
  • Two, it was specific to the software that I was actually using. This allowed every step to be as concrete as, “Next, click on the Manip menu and select …”. I was shown how to use the small set of software features that I really needed, and we ignored the rest.
  • Three, it was specific to my own data. I learned through the process of building a model for our own institution, with data pulled from our own database. It was the first time I had seen our alumni and donation data presented this way. If we had never proceeded to full-on data mining, I still would have learned a lot about our constituency.

Analytics is a popular topic of discussion at fundraising conferences, where everyone says the right things about predictive modeling and data-driven decision making. And yet, how many development offices are doing the work? Not as many as could be.

The bad news is, there is a skills shortage. The good news is, filling the shortage does not mean hiring analysts with advanced degrees in statistics (although, three cheers for you if you do). You or others in your office can do the work — but only if the barriers are removed.

What are the barriers? They are the flipside of the three strengths of one-on-one training:

  • One, many of the relevant books and online resources are couched in the language of statistics. Which elements of statistics are necessary to understand and which are optional is not made explicit. As well, there are numerous approaches to modeling, which confuses anyone trying to focus on the approach that works best for their application.
  • Two, the mechanics of modeling differ from software package to software package. A development office staff person looking for the exact set of steps to accomplish one specific task is not likely to find what they’re looking for.
  • Three, the would-be analyst needs to work with data from their own database and learn how to look at it in a whole new way. It helps if the teaching resource you’re using talks about data from an alumni or fundraising perspective, but even within that world, everyone’s data is different.

Any one of the three barriers may be surmountable on its own; it’s the fact that all three occur together that stops people in their tracks. That’s what happened to me in my tabula rasa moment. It’s like someone who’s never been in a kitchen before needing to cook a specific meal for which there is no recipe — because in the analytics kitchen, a recipe is not only specific to the desired dish (the outcome), but to the oven (software) and to the ingredients on hand (data). Any specific recipe would have to be adapted, which is too much to ask of the beginner cook. Conversely, any overall method that attempts to explain more than one dish, more than one brand of oven, and an endless variety of ingredients is too general to be called a recipe.

For these reasons, when people ask me how to get started in predictive modeling, I always steer them toward one-on-one training. Nothing else really works. Conference sessions can inspire, or lead to a new idea or two, but it stops there. Books are great, but there isn’t a single book that contains a step-by-step guide that covers more than a fraction of fundraising modeling situations. The Internet can be a wonderful resource, but much of what you’ll find is highly technical, doesn’t apply directly to our purposes, and is completely lacking a road map for the uninitiated.

Sadly, this blog has to be counted among the resources that don’t make the grade. I think CoolData does some things well: Addressing a gap, I have always used examples drawn from alumni, nonprofit and donor data; I’ve tried to string my ideas together in some kind of order (Guide to CoolData), and I’ve tried to stay focused on one outcome (behaviour prediction for segmentation, essentially) and one modeling technique (regression), instead of straying too often into other areas.

But I have not provided anything like a step-by-step guide that works for a majority of people who are interested in data mining but don’t know how to go about it. Not that I think it’s impossible. One-on-one training is superior to “book learning,” but I believe there ought to be options for other learning styles. A chef must learn the art in the presence of a master, but the rest of us have recipe books. While no one can deny the superiority of the former, the majority of us get by in the kitchen using the latter — and some dine very well thereby.

It would be an interesting challenge to come up with a way to convey how to do predictive modeling to a beginner in a way that balances the specificity of the recipe book with the endless variety of our real-world data kitchens. Such a product (whatever form it takes) might not be a substitute for training, but it could either augment training or at least get one started. Unlike this blog, it would probably not be free.

Well, it’s something to think about.

Create a free website or blog at WordPress.com.