Guest post by Peter B. Wylie, with John Sammis
Peter Wylie and I are each just back home, having presented at the fall conference of the Illinois chapter of the Association of Professional Researchers for Advancement (APRA-IL), hosted at Loyola University Chicago. (See photos, below!) Following an entertaining and fascinating look at the current and future state of predictive analytics presented by Josh Birkholz of Bentz Whaley Flessner, Peter and I gave a live demo of working with real data in Data Desk, with the assistance of Rush University Medical Center. We also drew names to give away a few copies of our book, Score! Data-Driven Success for Your Advancement Team.
We were impressed by the variety and quality of questions from attendees, in particular those having to do with stumbling blocks and barriers to progress. It was nice to be able to reassure people that when it comes to predictive modelling, some things aren’t worth worrying about.
Messy data, for example. Some databases, particularly those maintained by non higher ed nonprofits, have data integrity issues such as duplicate records. It would be a shame, we said, if data analysis were pushed to the back burner just because of a lack of purity in the data. Yes, work on improving data integrity — but don’t assume that you cannot derive valuable insights right now from your messy data.
And then the practice of predictive modelling itself … Oh, there is so much advice out there on the net, some of it highly technical and involving a hundred different advanced techniques. Anyone trying to learn on their own can get stymied, endlessly questioning whether what they’re doing is okay.
For them, our advice was this: In our field, you create value by ranking constituents according to their likelihood to engage in a behaviour of interest (giving, usually), which guides the spending of scarce resources where they will do the most good. You can accomplish this without the use of complex algorithms or arcane math. In fact, simpler models are often better models.
The workhorse tool for this task is multiple linear regression. A very good stand-in for regression is building a simple score using the techniques outlined in Peter’s book, Data Mining for Fundraisers. Sticking to the basics will work very well. Fussing with technical issues or striving for a high degree of accuracy are distractions that the beginner need not be overly concerned with.
If your shop’s current practice is to pick prospects or other targets by throwing darts, then even the crudest model will be an improvement. In many situations, simply performing better than random will be enough to create value. The bottom line: Just do it. Worry about perfection some other day.
If the decisions are high-stakes, if the model will be relied on to guide the deployment of scarce resources, then insert another step in the process. Go ahead and build the model, but don’t use it. Allow enough time of “business as usual” to elapse. Then, gather fresh examples of people who converted to donors, agreed to a bequest, or made a large gift — whatever the behaviour is you’ve tried to predict — and check their scores:
“Don’t worry, just do it” sounds like motivational advice, but it’s more than that. The fact is, there is only so much model validation you can do at the time you create the model. Sure, you can hold out a generous number of cases as a validation sample to test your scores with. But experience will show you that your scores will always pass the validation test just fine — and yet the model may still be worthless.
A holdout sample of data that is contemporaneous with that used to train the model is not the same as real results in the future. A better way to go might be to just use all your data to train the model (no holdout sample), which will result in a better model anyway, especially if you’re trying to predict something relatively uncommon like Planned Giving potential. Then, sit tight and observe how it does in production, or how it would have done in production if it had been deployed.
* A heartfelt thank you to APRA-IL and all who made our visit such a pleasure, especially Sabine Schuller (The Rotary Foundation), Katie Ingrao and Viviana Ramirez (Rush University Medical Center), Leigh Peterson Visaya (Loyola University Chicago), Beth Witherspoon (Elmhurst College), and Rodney P. Young, Jr. (DePaul University), who took the photos you see below. (See also: APRA IL Fall Conference Datapalooza.)
Click on any of these for a full-size image.
Not quite three years ago I blogged about a technique for estimating the age of your database constituents when you don’t have any relevant data such as birth date or class year. It was based on the idea that many first names are typically “young” or “old.” I expanded on the topic in a followup post: Putting an age-guessing trick to the test. Until now, I’ve never had a reason to guess someone’s age — alumni data is pretty well supplied in that department. This very month, though, I have not one but two major modeling projects to work on that involve constituents with very little age data present. I’ve worked out a few improvements to the technique which I will share today.
First, here’s the gist of the basic idea. Picture two women, named Freda and Katelyn. Do you imagine one of them as older than the other? I’m guessing you do. From your own experience, you know that a lot of young women and girls are named Katelyn, and that few if any older women are. Even if you aren’t sure about Freda, you would probably guess she’s older. If you plug these names into babynamewizard.com, you’ll see that Freda was a very popular baby name in the early 1900s, but fell out of the Top 1000 list sometime in the 1980s. On the other hand, Katelyn didn’t enter the Top 1000 until the 1970s and is still popular.
To make use of this information you need to turn it into data. You need to acquire a lot of data on the frequency of first names and how young or old they tend to be. If you work for a university or other school, you’re probably in luck: You might have a lot of birth dates for your alumni or, failing that, you have class years which in most cases will be a good proxy for age. This will be the source you’ll use for guessing the age of everyone else in your database — friends, parents and other person constituents — who don’t have ages. If you have a donor database that contains no age data, you might be able to source age-by-first name data somewhere else.
Back to Freda and Katelyn … when I query our database I find that the average age of constituents named Freda is 69, while the average age for Katelyn is 25. For the purpose of building a model, for anyone named Freda without an age, I will just assume she is 69, and for anyone named Katelyn, 25. It’s as simple as creating a table with two columns (First name and Average age), and matching this to your data file via First Name. My table has more than 13,500 unique first names. Some of these are single initials, and not every person goes by their first name, but that doesn’t necessarily invalidate the average age associated with them.
I’ve tested this method, and it’s an improvement over plugging missing values with an all-database average or median age. For a data set that has no age data at all, it should provide new information that wasn’t there before — information that is probably correlated with behaviours such as giving.
Now here’s a new wrinkle.
In my first post on this subject, I noted that some of the youngest names in our database are “gender flips.” Some of the more recent popular names used to be associated with the opposite gender decades ago. This seems to be most prevalent with young female names: Ainslie, Isadore, Sydney, Shelly, Brooke. It’s harder to find examples going in the other direction, but there are a few, some of them perhaps having to do with differences in ethnic origin: Kori, Dian, Karen, Shaune, Mina, Marian. In my data I have close to 600 first names that belong to members of both sexes. When I calculate average age by First Name separately for each sex, some names end up with the exact same age for male and female. These names have an androgynous quality to them: Lyndsay, Riley, Jayme, Jesse, Jody. At the other extreme are the names that have definitely flipped gender, which I’ve already given examples of … one of the largest differences being for Ainslie. The average male named Ainslie is 54 years older than the average female of the same name. (In my data, that is.)
These differences suggest an improvement to our age-inferring method: Matching on not just First Name, but Sex as well. Although only 600 of my names are double-gendered, they include many popular names, so that they actually represent almost one-quarter of all constituents.
Now here’s another wrinkle.
When we’re dealing with constituents who aren’t alumni, we may be missing certain personal information such as Sex. If we plan to match on Sex as well as First Name, we’ve got a problem. If Name Prefix is present, we can infer from whether it’s Mr., Ms., etc., but unless the person doing the data entry was having an off day, this shouldn’t be an avenue available to us — it should already be filled in. (If you know it’s “Mrs.,” then why not put in F for Sex?) For those records without a Sex recorded (or have a Sex of ‘N’), we need to make a guess. To do so, we return to our First Names query and the Sex data we do have.
In my list of 600 first names that are double-gendered, not many are actually androgynous. We have females named John and Peter, and we have males named Mary and Laura, but we all know that given any one person named John, chances are we’re talking about a male person. Mary is probably female. These may be coding errors or they may be genuine, but in any case we can use majority usage to help us decide. We’ll sometimes get it wrong — there are indeed boys named Sue — but if you have 7,000 Johns in your database and only five of them are female, then let’s assume (just for the convenience of data mining*) that all Johns are male.
So: Query your database to retrieve every first name that has a Sex code, and count up the instance of each. The default sex for each first name is decided by the highest count, male or female. To get a single variable for this, I subtract the number of females from the number of males for each first name. Since the result is positive for males and negative for females, I call it a “Maleness Score” — but you can do the reverse and call it a Femaleness Score if you wish! Results of zero are considered ties, or ‘N’.
At this point we’ve introduced a bit of circularity. For any person missing Age and Sex, first we have to guess their sex based on the majority code assigned to that person’s first name, and then go back to the same data to grab the Age that matches up with Name and Sex. Clearly we are going to get it very wrong for a lot of records. You can’t expect these guesses to hold up as well as true age data. Overall, though, there should be some signal in all that noise … if your model believes that “Edgar” is male and 72 years of age, and that “Brittany” is female and 26, well, that’s not unreasonable and it’s probably not far from the truth.
How do we put this all together? I build my models in Data Desk, so I need to get all these elements into my data file as individual variables. You can do this any way that works for you, but I use our database querying software (Hyperion Brio). I import the data into Brio as locally-saved tab-delimited files and join them up as you see below. The left table is my modeling data (or at least the part of it that holds First Name), and the two tables on the right hold the name-specific ages and sexes from all the database records that have this information available. I left-join each of these tables on the First Name field.
When I process the query, I get one row per ID with the fields from the left-hand table, plus the fields I need from the two tables on the right: the so-called Maleness Score, Female Avg Age by FName, Male Avg Age by Fname, and N Avg Age by Fname. I can now paste these as new variables into Data Desk. I still have work to do, though: I do have a small amount of “real” age data that I don’t want to overwrite, and not every First Name has a match in the alumni database. I have to figure out what I have, what I don’t have, and what I’m going to do to get a real or estimated age plugged in for every single record. I write an expression called Age Estimated to choose an age based on a hierarchical set of IF statements. The text of my expression is below — I will explain it in plain English following the expression.
if len('AGE')>0 then 'AGE' else if textof('SEX')="M" and len('M avg age by Fname')>0 then 'M avg age by Fname' else if textof('SEX')="M" and len('N avg age by Fname')>0 then 'N avg age by Fname' else if textof('SEX')="M" and len('F avg age by Fname')>0 then 'F avg age by Fname' else if textof('SEX')="F" and len('F avg age by Fname')>0 then 'F avg age by Fname' else if textof('SEX')="F" and len('N avg age by Fname')>0 then 'N avg age by Fname' else if textof('SEX')="F" and len('M avg age by Fname')>0 then 'M avg age by Fname' else if textof('SEX')="N" and 'Maleness score'>0 and len('M avg age by Fname')>0 then 'M avg age by Fname' else if textof('SEX')="N" and 'Maleness score'<0 and len('F avg age by Fname')>0 then 'F avg age by Fname' else if textof('SEX')="N" and 'Maleness score'=0 and len('N avg age by Fname')>0 then 'N avg age by Fname' else if len('N avg age by Fname')>0 then 'N avg age by Fname' else if len('F avg age by Fname')>0 then 'F avg age by Fname' else if len('M avg age by Fname')>0 then 'M avg age by Fname' else 49
Okay … here’s what the expression actually does, going block by block through the statements:
When I bin the cases into 20 roughly equal groups by Estimated Age, I see that the percentage of cases that have some giving history starts very low (about 3 percent for the youngest group), rises rapidly to more than 10 percent, and then gradually rises to almost 18 percent for the oldest group. That’s heading in the right direction at least. As well, being in the oldest 5% is also very highly correlated with Lifetime Giving, which is what we would expect from a donor data set containing true ages.
This is a bit of work, and probably the gain will be marginal a lot of the time. Data on real interactions that showed evidence of engagement would be superior to age-guessing, but when data is scarce a bit of added lift can’t hurt. If you’re concerned about introducing too much noise, then build models with and without Estimated Age, and evaluate them against each other. If your software offers multiple imputation for missing data as a feature, try checking that out … what I’m doing here is just a very manual form of multiple imputation — calculating plausible values for missing data based on the values of other variables. Be careful, though: A good predictor of Age happens to be Lifetime Giving, and if your aim is to predict Giving, I should think there’s a risk your model will suffer from feedback.
* One final note …
Earlier on I mentioned assuming someone is male or female “just for the convenience of data mining.” In our databases (and in a conventional, everyday sense too), we group people in various ways — sex, race, creed. But these categories are truly imperfect summaries of reality. (Some more imperfect than others!) A lot of human diversity is not captured in data, including things we formerly thought of as clear-cut. Sex seems conveniently binary, but in reality it is multi-category, or maybe it’s a continuous variable. (Or maybe it’s too complex for a single variable.) In real life I don’t assume that when someone in the Registrar’s Office enters ‘N’ for Sex that the student’s data is merely missing. Because the N category is still such a small slice of the population I might treat it as missing, or reapportion it to either Male or Female as I do here. But that’s strictly for predictive modeling. It’s not a statement about transgendered or differently gendered people nor an opinion about where they “belong.”
Everyone’s talking about predictive analytics, Big Data, yadda yadda. The good news is, many institutions and organizations in our sector are indeed making investments in analytics and inching towards becoming data-driven. I have to wonder, though, how much of current investment is based on hype, and how much is going to fall away when data is no longer a hot thing.
Becoming a data-driven organization is a journey, not a destination. Forward progress is not inevitable, and it is possible for an office, a department or an institution to slip backward on the path, even when it seems they’ve “arrived”. In order for analytics to mature from a cutting-edge “nice-to-have” into a regular part of operations, the enterprise needs to be aware of its returns to the bottom line.
In my view, current investments in analytics are often done for reasons that are well-intentioned but vague: It seems to be the right thing to do these days … we see others doing it, so we feel we need to as well … we have an agenda for innovation and this fits the bill … and so on. I’m glad to see the investment, but not every promising innovation gets to stick around. Demonstrating ability to generate revenue — either through savings or through identifying new sources of revenue — will carry the day in the long run.
As I write this, I hear the jangle of railway bells at the level crossing in the early-morning dark outside my hotel room on the city’s downtown waterfront. I’m in Seattle today to attend the DRIVE 2013 conference, hosted by the University of Washington. I’ll be speaking on this topic — the “analytic” investment — later today. I have to admit to having struggled with making the session relevant for this group. For one, they don’t need convincing that making the investment is worth it. And second, if they think that I and my employer have figured out how to calculate the return on investment for analytics programs, they may be in for a disappointment. We have not.
In fact, when it comes right down to it, I like to spend my day working on cool things, interesting problems that face our department, and not so much on stuff that sounds like accounting (“ROI”). I’m betting many of the attendees of my session feel the same way. So I’ll be asking them to stop thinking about how they can get their managers, directors and vice presidents to understand the language of data and analytics. They’ll be far more successful if they try to speak the language their bosses respond to: Return on investment.
I may be a little short on answers for you, but I do have some pretty good questions.
It’s been a full year since I attended the first DRIVE Conference in Seattle, and I’m pleased to let you know (if you don’t already) that a second one is on the way. DRIVE 2013 takes place February 20-21 at the Bell Harbor International Conference Center in Seattle, Washington, and is hosted by the University of Washington. Registration is now open!
I’ll be making the trip to DRIVE 2013, and I think you should, too. I’m there to speak, but I expect to get a whole lot more out of it than I give.
DRIVE stands for those most awesome and beautiful words “Data, Reporting, Information and Visualization Exchange.” It’s a gathering-place for the growing community of non-profit IT/data people seeking to bring new ideas and efficient processes and systems to their organizations. Whether you’re just joining the non-profit ranks or you’ve been in the sector a while, this is the place to explore the latest ideas in analytics, modeling, data, reporting, information and visualization with people who are of like mind but come from all sorts of different backgrounds.
It’s this diversity that really injects value into the “exchange” part of DRIVE: You’ll meet some fascinating people who will help you see data-driven performance through a whole new lens.
Especially this year … wow. There are fundraisers and report-writers and data miners – all great. But a developmental molecular biologist? And a major-league baseball scout? Yes!
On top of that, there’s an opportunity to sign up for some on-the-spot mentoring (either as a mentor or mentee) which will allow you to have a focused conversation on a topic of interest that goes beyond the merely social aspect of a conference. Check that out on the conference website.
A few speaker highlights:
DR. JOHN J. MEDINA, a developmental molecular biologist, has a lifelong fascination with how the mind reacts to and organizes information. He is the author of the New York Times bestseller “Brain Rules: 12 Principles for Surviving and Thriving at Work, Home, and School” — a provocative book that takes on the way our schools and work environments are designed.
ASHUTOSH NANDESHWAR, Associate Director of Analytics at the University of Michigan, will talk about how we can tackle the three biggest problems in fundraising using data science.
KARL R. HABERL, of Principal BI will be presenting on the merits of powerful visualization. His presentation will introduce you to three innovative ‘compound charting techniques’ that provide new levels of insights to analysts and their audiences.
ANDREW PERCIVAL, an advanced scout with Major League Baseball’s Seattle Mariners. For his presentation, Andrew will be speaking about the use of data in the game of baseball. Come hear how an MLB scout turns massive data sets into information that is used by coaches and front-office personnel.
Oh yeah – and me, and a whole lot more. For more information on the other speakers and topics lined up so far, visit the DRIVE 2013 website.