CoolData blog

28 June 2015

Data mining in the archives

Filed under: Data, Predictor variables — Tags: , , , — kevinmacdonell @ 6:24 pm

 

When I was a student, I worked in a university archives to earn a little money. I spent many hours penciling consecutive index numbers onto acid-free paper folders, on the ultra-quiet top floor of the library. It was as dull a job as one can imagine.

 

Today’s post is not about that kind of archive. I’m talking about database archive views, also called snapshots. They’re useful for reporting and business intelligence, but they can also play a role in predictive modelling.

 

What is an archive view?

 

Think of a basic stat such as “number of living alumni”. This number changes constantly as new alumni join the fold and others are identified as deceased. A straightforward query will tell you how many living alumni there are, but that number will be out of date tomorrow. What if someone asks you how many living alumni you had a year ago? Then it’s necessary to take grad dates and death dates into account in order to generate an estimate. Or, you look the number up in previously-reported statistics.

 

A database archive view makes such reporting relatively easy by preserving the exact status of a record at regular points in time. The ideal archive is a materialized view in a data warehouse. On a given schedule (yearly, quarterly, or even monthly), an automated process adds fresh rows to an archive table that keeps getting longer and longer. You’re likely reliant on central IT services to set it up.

 

“Number of living alumni” is an important denominator for such key ratios as the percentage of alumni for whom you have contact information (mail, phone, email) and participation rates (the proportion of alumni who give). Every gift is entered as an individual transaction record with a specific date, which enables reporting on historical giving activity. This tends not to be true of contact information. Even though mailing addresses may be added one after another, without overwriting older addresses, the key piece of information is whether the address is coded ‘valid’ or ‘invalid’. This changes all the time, and your database may not preserve a history of those changes. Contact information records may have “To” and “From” dates associated with them, but your query will need to do a lot of relative-date calculations to determine if someone was both alive and had a valid address for any given point in time in the past.

 

An archive table obviates the need for this complex logic, and ours looks like the example below. There’s the unique ID of each individual, the archive date, and a series of binary indicator variables — ‘1’ for “yes, this data is present” or ‘0’ for “this data is absent”.

 

archive

 

Here we see three individuals and how their data has changed over three months in 2015. This is sorted by ID rather than by the order in which the records were added to the archive, so that you can see the journey each person has taken in that time:

 

  • A00001 had no valid email in the database in February and March, but we obtained it in time for the April 1 snapshot.
  • A00002 had no contact information at all until just before March, when a phone append supplied us with a new number. The number proved to be invalid, however, and when we coded it as such in the database, the indicator reverted to zero.
  • A00003 appeared in our data in February and March, but that person was coded deceased in the database before April 1, and was excluded from the April snapshot.

 

That last bullet point is important. Once someone has died, continuing to include them as a row in the archive every month would be a waste of resources. In your reporting software, a simple count of records by archive date will give you the number of living alumni. A simple count of ‘Address Indicator’ will give you the number of alumni with valid addresses. Dividing the number of valid addresses by the number of living alumni (and multiplying by 100) will give you the percentage of living alumni that are addressable for that month. (Reporting software such as Tableau will make very quick work of all this.)

 

Because an archive view preserves changing statuses over time at the level of the individual constituent, it can be used for reporting trends along any slice you choose (age bracket, geography, school, etc.), and can play a role in staff activity/performance reporting and alumni engagement scoring.

 

But enough about archive views themselves. Let’s talk about using them for predictive modelling.

 

In the archive example above, you see a bunch of 0/1 indicator variables. Indicator variables are common in predictive modelling. For example, “Mailing address present” can have one of two states: Present or not present. It’s binary. A frequency breakdown of my data at this point in time looks like this (in Data Desk):

 

freq1

 

About 78% of living alumni have a valid address in the database today — the records with an address indicator of ‘1’. As you might expect, alumni with a good address are more likely to have given than alumni without, and they have much higher lifetime giving on average. In the models I build to predict likelihood to give (and give at higher levels), I almost always make use of this association between contact information and giving.

 

But what about using the archive view data instead? The ‘Address Indicator’ variable breakdown above shows me the current situation, but the archive view adds depth by going back in time. Our own archive has been taking monthly snapshots since December of last year — seven distinct points in time. Summing on “Address Indicator” for each ID shows that large numbers of alumni have either never had a valid address during that time (0 out of 7 months), or always did (7 out of 7). The rest had a change of status during the period, and therefore fall between 0 and 7:

 

freq2

 

A few hundred alumni (387) had a valid address in one out of seven months, 143 had a valid address in two out of seven — and so on. Our archive is still very young; only about 1% of alumni have a count that is not 0 or 7. A year from now, we can expect to see far more constituents populating the middle ground.

 

What is most interesting to me is an apparent relationship between “number of months with valid address” (x-axis) and average lifetime giving (y-axis), even with the relative scarcity of data:

 

chart1

 

My real question, of course, is whether these summed, continuous indicators really make much of a difference in a model over simply using the more familiar binary variables. The answer is “not yet — but someday.” As I noted earlier, only about 1% of living alumni have changed status in the past seven months, so even though this relationship seems linear, the numbers aren’t there to influence the strength of correlation. The Pearson correlation for “Address Indicator” (0/1) and “Lifetime Giving” is 0.186, which is identical to the Pearson correlation for “Address Count” (0 to 7) and “Lifetime Giving.” For all other variables except one, the archive counts have only very slightly higher correlations with Lifetime Giving than the straight indicator variables. (Email is slightly lower.)

 

It’s early days yet. All I can say is that there is potential. Have a look at this pair of regression analyses, both using Lifetime Giving (log-transformed) as the dependent variable. (Click on image for larger view.) In the window on the left, all the independent variables are the regular binary indicator variables. On the right, the independent variables are counts from our archive view. The difference in R-squared from one model to the other is very slight, but headed in the right direction: From 12.7% to 13.0%.

 

regressions

 

Looking back on my student days, I cannot deny that I enjoyed even the quiet, dull hours spent in the university archives. Fortunately, though, and due in no small part to cool data like this, my work since then has been a lot more interesting. Stay tuned for more from our archives.

 

11 May 2015

A new way to look at alumni web survey data

Filed under: Alumni, Surveying, Vendors — Tags: , , , , — kevinmacdonell @ 7:38 pm

Guest post by Peter B. Wylie, with John Sammis

 

Click to download the PDF file of this discussion paper: A New Way to Look at Survey Data

 

Web-based surveys of alumni are useful for all sorts of reasons. If you go to the extra trouble of doing some analysis — or push your survey vendor to supply it — you can derive useful insights that could add huge value to your investment in surveying.

 

This discussion paper by Peter B. Wylie and John Sammis demonstrates a few of the insights that emerge by matching up survey data with some of the plentiful data you have on alums who respond to your survey, as well as those who don’t.

 

Neither alumni survey vendors nor their higher education clients are doing much work in this area. But as Peter writes, “None of us in advancement can do too much of this kind of analysis.”

 

Download: A New Way to Look at Survey Data

 

 

5 May 2015

Predictive modelling for the nonprofit organization

Filed under: Non-university settings, Why predictive modeling? — Tags: , , — kevinmacdonell @ 6:15 pm

 

Predictive modelling enables an organization to focus its limited resources of time and money where they will earn the best return, using data. People who work at nonprofits can probably relate to the “limited resources” part of that statement. But is it a given that predictive analytics is possible or necessary for any organization?

 

This week, I’m in Kingston, Ontario to speak at the conference of the Association of Fundraising Professionals, Southeastern Ontario Chapter (AFP SEO). As usual I will be talking about how fundraisers can use data. Given the range of organizations represented at this conference, I’m considering questions that a small nonprofit might need to answer before jumping in. They boil down to two concerns, “when” and “what”:

 

When is the tipping point at which it makes sense to employ predictive modelling? And how is that tipping point defined — dollars raised, number of donors, size of database, or what?

 

What kind of data do we need to collect in order to do predictive modelling? How much should we be willing to spend to gather that data? What type of model should we build?

 

These sound like fundamental questions, yet I’ve rarely had to consider them. In higher education advancement, the questions are answered already.

 

In the first case, most universities are already over the tipping point. Even relatively small institutions have more non-donor alumni than they can solicit all at once via mail and phone — it’s just too expensive and it takes too much time. Prioritization is always necessary. Not all universities are using predictive modelling, but all could certainly benefit from doing so.

 

Regarding the second question — what data to collect — alumni databases are typically rich in the types of data useful for gauging affinity and propensity to give. Knowing everyone’s age is a huge advantage, for example. Even if the Advancement office doesn’t have ages for everyone, at least they have class year, which is usually a good proxy for age. Universities don’t always do a great job of tracking key engagement factors (event attendance, volunteering, and so on), but I’ve been fortunate in being able to have enough of this already-existing data with which to build robust models.

 

The situation is different for nonprofits, including small organizations that may not have real databases. (That situation was the topic I wrote about in my previous post: When does a small nonprofit need a database?) One can’t simply assume that predictive modelling is worth the trouble, nor can one assume that the data is available or worth investing in.

 

Fortunately the first question isn’t hard to answer, and I’ve already hinted at it. The tipping point occurs when the size of your constituency is so large that you cannot afford to reach out to all of them simultaneously. Your constituency may consist of any combination of past donors, volunteers, clients of your services, ticket buyers and subscribers, event attendees — anyone who has a reason to be in your database due to some connection with your organization.

 

Here’s an extreme example from the non-alumni charity world. Last year’s ALS Ice-Bucket Challenge already seems like a long time ago (which is the way of any social media-driven frenzy), but the real challenge is now squarely on the shoulders of ALS charities. Their constituency has grown by millions of new donors, but there is no guarantee that this windfall will translate into an elevated level of donor support in the long run. It’s a massive donor-retention problem: Most new donors will not give again, but retaining even a fraction could lead to a sizeable echo of giving. It always makes sense to ask recent donors to give again, but I think it would be incredibly wasteful to attempt reaching out to 2.5 million one-time donors. The organization needs to reach out to the right donors. I have no special insight into what ALS charities are doing, but this scenario screams “predictive modelling” to me. (I’ve written about it here: Your nonprofit’s real ice bucket challenge.)

 

None of us can relate to the ice-bucket thing, because it’s almost unique, but smaller versions of this dilemma abound. Let’s say your theatre company has a database with 20,000 records in it — people who have purchased subscriptions over the years, plus single-ticket buyers, plus all your donors (current and long-lapsed). You plan to run a two-week phone campaign for donations, but there’s no way you can reach everyone with a phone number in that limited time. You need a way to rank your constituents by likelihood to give, in order to maximize your return.

 

(About five years ago, I built a model using data from a symphony orchestra’s database. Among other things, I found that certain combinations of concert series subscriptions were associated with higher levels of giving. So: you don’t need a university alumni database to do this work!)

 

It works with smaller numbers, too. Let’s say your college has 1,000 alumni living in Toronto, and you want to invite them all to an event. Your budget allows a mail piece to be sent to just 250, however. If you have a predictive model for likelihood to attend an event, you can send mail to only the best prospective attendees, and perhaps email the rest.

 

In a reverse scenario, if your charity has 500 donors and you’re fully capable of contacting and visiting them all as often as you like, then there’s no business need for predictive modelling. I would also note that modelling is harder to do with small data sets, entailing  problems such as overfitting. But that’s a technical issue; it’s enough to know that modelling is something to consider only at the point when resources won’t cover the need to engage with your whole constituency.

 

Now for the second question: What data do you need?

 

My first suggestion is that you look to the data you already have. Going back to the example of the symphony orchestra: The data I used actually came from two different systems — one for donor management, the other for ticketing and concert series subscriptions. The key was that donors and concert attendees were each identified with a unique ID that spanned both databases. This allowed me to discover that people who favoured the great Classical composers were better donors than those who liked the “pops” concerts — but that people who attended both were the best donors of all! If the orchestra intended to identify a pool of prospects for leadership gifts, this would be one piece of the ranking score that would help them do it.

 

So: Explore your existing data. And while you’re doing so, don’t assume that messy, old, or incomplete data is not useable. It’s usually worth a look.

 

What about collecting new data? This can be an expensive proposition, and I think it would be risky to gather data just so you can build predictive models. There is no guarantee that what you’re spending time and money to gather is actually correlated with giving or other behaviours. My suggestion would be to gather data that serves operational purposes as well as analytical ones. A good example might be event attendance. If your organization holds a lot of events, you’ll want to keep statistics on attendance and how effective each event was. If you can find ways to record which individuals were at the event (donors, volunteers, community members), you will get this information, plus you will get a valuable input for your models.

 

Surveying is another way organizations can collect useful data for analysis while also serving other purposes. It’s one way to find out how old donors are — a key piece of information. Just be sure that your surveys are not anonymous! In my experience, people are not turned off by non-anonymous surveys so long as you’re not asking deeply personal questions. Offering a chance to win a prize for completing the survey can help.

 

Data you might gather on individuals falls into two general categories: Behaviours and attributes.

 

Behaviours are any type of action people take that might indicate affinity with your organization. Giving is obviously the big one, but other good examples would be event attendance or volunteering, or any type of interaction with your organization.

 

Attributes are just characteristics that prospects happen to have. This includes gender, where a person lives, age, wealth information, and so on.

 

Of the two types, behavioural factors are always the more powerful. You can never go wrong by looking at what people actually do. As the saying has it, people give of their time, talent, and treasure. Focus on those interactions first.

 

People also give of something else that is increasingly valuable: Their attention. If your organization makes use of a broadcast email platform, find out if it tracks opens and click-throughs — not just at the aggregate level, but at the individual level. Some platforms even assign a score to each email address that indicates the level of engagement with your emails. If you run phone campaigns, keep track of who answers the call. The world is so full of distractions, these periods of time when you have someone’s full attention are themselves gifts — and they are directly associated with likelihood to give financially.

 

Attributes are trickier. They can lead you astray with correlations that look real, but aren’t. Age is always a good thing to have, but gender is only sometimes useful. And I would never purchase external data (census and demographic data, for example) for predictive modelling alone. Aggregate data at the ZIP or postal code level is useful for a lot of things, but is not the strongest candidate for a model input. The correlations with giving to your organization will be weak, especially in comparison with the behavioural data you have on individuals.

 

What type of model does it make sense for a nonprofit to try to build first? Any modelling project starts with a clear statement of the business need. Perhaps you want to identify which ticket buyers will convert to donors, or which long-lapsed donors are most likely to respond positively to a phone call, or who among your past clients is most likely to be interested in becoming a volunteer.

 

Whatever it is, the key thing is that you have plenty of historical examples of the behaviour you want to predict. You want to have a big, fat target to aim for. If you want to predict likelihood to attend an event and your database contains 30,000 addressable records, you can be quite successful if 1,000 of those records have some history of attending events — but your model will be a flop if you’ve only got 50. The reason is that you’re trying to identify the behaviours and characteristics that typify the “event attendee,” and then go looking in your “non-attendee” group for those people who share those behaviours and characteristics. The better they fit the profile, the more likely they are to respond to an event invitation. Fifty people is probably not enough to define what is “typical.”

 

So for your first foray into modelling, I would avoid trying to hit very small targets. Major giving and planned giving propensity tend to fall into that category. I know why people choose to start there — because it implies high return on investment — but you would be wise to resist.

 

At this point, someone who’s done some reading may start to obsess about which highly advanced technique to use. But if you’re new to hands-on work, I strongly suggest using a simple method that requires you to study each variable individually, in relation to the outcome you’re trying to model. The best beginning point is to get familiar with comparing groups (attendees vs. non-attendees, donors vs. non-donors, etc.) using means and medians, preferably with the aid of a stats software package. (Peter Wylie’s book, Data Mining for Fundraisers has this covered.) From there, learn a bit more about exploring associations and correlations between variables by looking at scatterplots and using Pearson Product-Moment Correlation. That will set you up well for learning to do multiple linear regression, if you choose to take it that far.

 

In sum: Predictive modeling isn’t for everyone, but you don’t need Big Data or a degree in statistics to get some benefit from it. Start small, and build from there.

 

3 May 2015

When does a small nonprofit need a database?

Filed under: Non-university settings — Tags: , , , — kevinmacdonell @ 9:25 am

 

I had a dream a few nights ago in which I was telling my wife about a job interview I’d just had. A small rural Anglican church serving British expats was hiring a head of fund development. (I have very specific dreams.) I lamented that I had forgotten to ask some key questions: “I don’t even know if they have a database!”

 

Not all of my dreams are that nerdy. The fact is, nonprofit organizations (as opposed to higher education institutions — my usual concern) are on my mind lately, as I am preparing a conference presentation for a group that includes the full range of organizations, many of them small. I’m presenting on predictive modelling, but like that rural church, some organizations may not yet have a proper database.

 

When should an organization acquire some kind of database system or CRM?

 

Any organization, no matter how small, has to track activity and record information for operational purposes. This may be especially true for nonprofits that need to report on the impact they’re having in the community. I usually think in terms of tracking donors, but nonprofits may have an additional need to track clients and services.

 

Alas, the go-to is often the everyday Excel spreadsheet. It’s clear way: Excel is flexible, adaptable, comprehensible, and ubiquitous. Plus, if you’re a whiz, there are advanced features to explore. But while an Excel file can store data, it is NOT a true database. For a growing nonprofit, managing everything in spreadsheets will become an expensive liability. You may have already achieved a painful awareness of that fact. For others who aren’t there yet, here are a few warning signs that spreadsheets have outstayed their welcome in your office.

 

One: Even on a wide screen at 80% zoom, you have to do a lot of horizontal scrolling.

 

At the start, a spreadsheet seems so straightforward … A column each for First Name, Last Name, and some more columns for address information, phone and email. Then one day, you have a client or donor who has a second address — a business or seasonal address — and she wants to get your newsletter at one or the other, depending on the time of year. Both addresses are valid, so you need to add more columns. Hmm, and of course you want to track who attended your last event. If someone attends an event in July and another in December, you’ll need a column to record each event. As each volunteer has a new activity, as each client has a new interaction with your services, you are adding more and more columns until the sideways scrolling gets ridiculous.

 

Two: Your spreadsheet has so many rows that it is unwieldy to find or update individual records.

 

It’s technically true that an Excel file can store a million rows, but you probably wouldn’t want to open such a file on your computer. Files with just a few thousand rows can cause trouble after they’ve been worked over long enough. You can always tell a spreadsheet that’s been used to store data in the place of a true database, especially if more than one person has been mucking around in it. It’s in rough shape. In particular, errors made while sorting rows can lead to lost data and headaches all round.

 

Three: Several spreadsheets are being maintained separately, tracking different types of data on the same people.

 

Given the issues with large files, you’ll soon be tempted to have a separate sheet for each type of data. If you have a number of people on staff, each might be independently tracking the information that is relevant to their own work: One person tracking donors, another volunteers, another event attendees. John Doe might exist as a row in one or more of these separate files. If each file contains contact information, every change of address becomes a big deal, as it has to be applied in multiple places. Inevitably, the files get out of sync. As bad or worse, insights are not being shared across data files. Reporting is cumbersome, and anything like predictive modelling is impossible.

 

If this sounds like your situation, know that you’re not alone. I would be lying if I said rampant Excel use doesn’t occur in the (often) better-resourced world of higher education. Of course it does. Sometimes people don’t have the kind of access to the data they need, sometimes the database doesn’t have a module tailored to their business requirements, and sometimes people can’t be bothered to care about institution-wide data integrity. Shadow databases are a real problem on large campuses, and some of those orphan data stores are in Excel.

 

There’s nothing magic about a true database. It’s all about structure. A database stores data in tables, behind the scenes, and each table is very similar to a spreadsheet: it’s rectangular, and made up of rows and columns. The difference is that a single table usually holds only one type of data: Addresses, for example, or gift transactions. A table may be very long, with millions of rows, but it is typically not very wide, because each table serves only one purpose. As a consequence, a database has to have many tables, one for each thing needing to be stored. A complex enterprise database could have thousands of tables.

 

This sounds like chaos, but every record in a table contains a reference to data in another table. Tables are joined together by these identifiers, or keys. This allows a query of the database to retrieve John Smith from the ‘names’ table, the proper address for John Smith from the ‘addresses’ table, a sum of gifts made by John Smith from the ‘gifts’ table, and a volunteer status code for John Smith from the ‘volunteers’ table. When John Smith moves and provides his new address, that information is added as a new record in the ‘addresses’ table, attached to his unique identifier (i.e., his ID number). The old address is not deleted, but is marked ‘invalid’, so that the information is retained but never appears on a list of valid addresses. One place, one change — and it’s done.

 

That’s a quick and rather inadequate description of what a database is and does. There’s more to a donor management system than just a table structure, and I could say plenty more about user interfaces, reporting, and data integrity and security. But there is no shortage of information and guidance online, so I will leave you with a few places to go for good advice. There are many software solutions out there for organizations big and small.

 

Robert L. Weiner is a nonprofit technology consultant, helping fundraisers choose software tools. Check out his Ten Common Mistakes in Selecting Donor Databases (And How to Avoid Them). As you proceed toward acquiring a system, here is a piece published by AFP that has good, basic advice about how to manage it: Overcoming Database Demons.

 

Andrew Urban is author of a great book that helps guide nonprofits large and small in making wise choices in software and systems investments: The Nonprofit Buyer: Strategies for Success from a Nonprofit Technology Sales Veteran.

 

That’s all from me on this … CoolData’s domain is not systems or databases, but the data itself. A good system is simply a basic requirement for analysis. In my next post, I will address another question a small nonprofit might have: At one point is a nonprofit “big” enough to be able to get benefit from doing predictive modelling?

 

19 April 2015

Planned Giving prospect identification, driven by data

Filed under: Planned Giving, Prospect identification — Tags: , , , — kevinmacdonell @ 6:28 pm

I’m looking forward to giving two presentations in my home city in connection with this week’s national conference of the Canadian Association of Gift Planners (CAGP). In theory I’ll be talking about data-driven prospect identification for Planned Giving … “in theory” because my primary aim isn’t to provide a how-to for analyzing data.

 

Rather, I will urge fundraisers to seek “data partners” in their organizations — finding that person who is closest to the data — and posing some good questions. There’s a lot of value hidden in your data, and you can’t realize this value alone: You’ve got to work closely with your colleagues in Advancement Services or with any researcher, analyst, or IT person who can get you what you need. And you have to be able to tell that person what you’re looking for.

 

For a shop that’s done little or no analysis of their data, I would start with these two basic questions:

 

  1. What is the average age of new expectancies, at the time they became known to your organization?
  2. What is the size of your general prospect pool?
The answer to the first question might suggest that more active prospect identification is required, of the type more often associated with major-gift fundraising. If the average age is 75 or older, I have to think that earlier identification of bequest intentions would benefit donor and cause alike, by allowing for a longer period for the conversation to mature and for the relationship to develop.

 

The answer to the second question gives an indication of the potential that exists in the database — but also the challenge of zeroing in on the few people (the top 100, say) in that universe of prospects who are most likely to accept a personal visit. Again, I’m talking about high-touch fundraising — more like Major Gifts, less like Annual Fund.

 

As Planned Giving professionals get comfortable asking questions of the data, the quality of the questions should improve. Ideally, the analyses will move from one-off projects to an ongoing process of gathering insights and then applying them. Along those lines, I will be giving attendees for both presentations a taste of how some simple targeting via data mining might work. As Peter Wylie and I wrote in our book, “Score!”, data mining for Planned Giving is primarily about improving the odds of success.

 

I said that I’m giving two presentations. Actually, it’s the same presentation, for two audiences. The first talk will be for a higher ed audience in advance of the conference, and the second will be for a more general nonprofit audience attending the conference proper. I expect the questions and conversations to differ significantly, and I also expect some of my assertions about Planned Giving fundraising to be challenged. Should be interesting!

 

Since you’ve read this far, you might be interested in downloading the handout I’ve prepared for these talks: Data-driven prospect ID for Planned Giving. There’s nothing like being there in person for the conversation we’re going to have, but this discussion paper does cover most of what I’ll be talking about.

 

If you’re visiting Halifax for the conference, welcome! I look forward to meeting with you.

 

1 April 2015

Mind the data science gap

Filed under: Training / Professional Development — Tags: , , — kevinmacdonell @ 8:10 pm

 

Being a forward-thinking lot, the data-obsessed among us are always pondering the best next step to take in professional development. There are more options every day, from a Data Science track on Coursera to new masters degree programs in predictive analytics. I hear a lot of talk about acquiring skills in R, machine learning, and advanced modelling techniques.

 

All to the good, in general. What university or large non-profit wouldn’t benefit from having a highly-trained, triple-threat chameleon with statistics, programming, and data analytics skills? I think it’s great that people are investing serious time and brain cells pursuing their passion for data analysis.

 

And yet, one has to wonder, are these advanced courses and tools helping drive bottom-line results across the sector? Are they helping people at nonprofits and university advancement offices do a better job of analyzing their data toward some useful end?

 

I have a few doubts. The institutions and causes that employ these enterprising learners may be fortunate to have them, but I would worry about retention. Wouldn’t these rock stars eventually feel constrained in the nonprofit or higher ed world? It’s a great place to apply one’s creativity, but aren’t the problems and applications one can address with data in our field relatively straightforward in comparison with other fields? (Tailoring medical treatment to an individual’s DNA, preventing terrorism or bank fraud, getting an American president elected?) And then there’s the pay.

 

Maybe I’m wrong to think so. Clearly there are talented people working in our sector who are here because they have found the perfect combination of passions. They want to be here.

 

Anyway — rock star retention is not my biggest concern.

 

I’m more concerned about the rest of us: people who want to make better use of data, but aren’t planning to learn way more than we need or are capable of. I’m concerned for a couple of reasons.

 

First, many of the professional development options available are pitched at a level too advanced to be practical for organizations who haven’t hired a full-time predictive analytics specialist. The majority of professionals working in the non-profit and higher-ed sectors are mainly interested in getting better at their jobs, whether that’s increasing dollars raised or boosting engagement among their communities. They don’t need to learn to code. They do need some basic, solid training options. I’m not sure these are easy to spot among all the competing offerings and (let’s be honest) the Big Data hype.

 

These people need support and appropriate training. There’s a place for scripting and machine learning, but let’s ensure we are already up to speed on means/medians, bar charts, basic scoring, correlation, and regression. Sexy? No. But useful, powerful, necessary. Relatively simple and manual techniques that are accessible to a range of advancement professionals — not just the highly technical — offer a high return on investment. It would be a shame if the majority were cowed into thinking that data analysis isn’t for them just because they don’t see what neural networks have to do with their day to day work.

 

My second concern is that some of the advanced tools of data science are deceptively easy to use. I read an article recently that stated that when it’s done really well, data science looks easy. That’s a problem. A machine-learning algorithm will spit out answers, but are they worth anything? (Maybe.) Does an analyst learn anything about their data by tweaking the knobs on a black box? (Probably not.) Is skipping over the inconvenience of manual data exploration detrimental to gaining valuable insights? (Yes!)

 

Don’t get me wrong — I think R, Python, and other tools are extremely useful for predictive modelling, although not for doing the modelling itself (not in my hands, at least). I use SQL and Python to automate the assembly of large data files to feed into Data Desk — it’s so nice to push a button and have the script merge together data from the database, from our phonathon database, from our broadcast email platform and other sources, as well as automatically create certain indicator variables, pivoting all kinds of categorical variables and handling missing data elegantly. Preparing this file using more manual methods would take days.

 

But this doesn’t automate exploration of the data, it doesn’t remove the need to be careful about preparing data to answer the business question, and it does absolutely nothing to help define that business question. Rather than let a script grind unsupervised through the data to spit out a result seconds later without any subject-matter expertise being applied, the real work of building a model is still done manually, in Data Desk, and right now I doubt there is a better way.

 

When it comes to professional development, then, all I can say is, “to each their own.” There is no one best route. The important thing is to ensure that motivated professionals are matched to training that is a good fit with their aptitudes and with the real needs of the organization.

 

Older Posts »

The Silver is the New Black Theme. Create a free website or blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 1,167 other followers