CoolData blog

1 April 2015

Mind the data science gap

Filed under: Training / Professional Development — Tags: , , — kevinmacdonell @ 8:10 pm

 

Being a forward-thinking lot, the data-obsessed among us are always pondering the best next step to take in professional development. There are more options every day, from a Data Science track on Coursera to new masters degree programs in predictive analytics. I hear a lot of talk about acquiring skills in R, machine learning, and advanced modelling techniques.

 

All to the good, in general. What university or large non-profit wouldn’t benefit from having a highly-trained, triple-threat chameleon with statistics, programming, and data analytics skills? I think it’s great that people are investing serious time and brain cells pursuing their passion for data analysis.

 

And yet, one has to wonder, are these advanced courses and tools helping drive bottom-line results across the sector? Are they helping people at nonprofits and university advancement offices do a better job of analyzing their data toward some useful end?

 

I have a few doubts. The institutions and causes that employ these enterprising learners may be fortunate to have them, but I would worry about retention. Wouldn’t these rock stars eventually feel constrained in the nonprofit or higher ed world? It’s a great place to apply one’s creativity, but aren’t the problems and applications one can address with data in our field relatively straightforward in comparison with other fields? (Tailoring medical treatment to an individual’s DNA, preventing terrorism or bank fraud, getting an American president elected?) And then there’s the pay.

 

Maybe I’m wrong to think so. Clearly there are talented people working in our sector who are here because they have found the perfect combination of passions. They want to be here.

 

Anyway — rock star retention is not my biggest concern.

 

I’m more concerned about the rest of us: people who want to make better use of data, but aren’t planning to learn way more than we need or are capable of. I’m concerned for a couple of reasons.

 

First, many of the professional development options available are pitched at a level too advanced to be practical for organizations who haven’t hired a full-time predictive analytics specialist. The majority of professionals working in the non-profit and higher-ed sectors are mainly interested in getting better at their jobs, whether that’s increasing dollars raised or boosting engagement among their communities. They don’t need to learn to code. They do need some basic, solid training options. I’m not sure these are easy to spot among all the competing offerings and (let’s be honest) the Big Data hype.

 

These people need support and appropriate training. There’s a place for scripting and machine learning, but let’s ensure we are already up to speed on means/medians, bar charts, basic scoring, correlation, and regression. Sexy? No. But useful, powerful, necessary. Relatively simple and manual techniques that are accessible to a range of advancement professionals — not just the highly technical — offer a high return on investment. It would be a shame if the majority were cowed into thinking that data analysis isn’t for them just because they don’t see what neural networks have to do with their day to day work.

 

My second concern is that some of the advanced tools of data science are deceptively easy to use. I read an article recently that stated that when it’s done really well, data science looks easy. That’s a problem. A machine-learning algorithm will spit out answers, but are they worth anything? (Maybe.) Does an analyst learn anything about their data by tweaking the knobs on a black box? (Probably not.) Is skipping over the inconvenience of manual data exploration detrimental to gaining valuable insights? (Yes!)

 

Don’t get me wrong — I think R, Python, and other tools are extremely useful for predictive modelling, although not for doing the modelling itself (not in my hands, at least). I use SQL and Python to automate the assembly of large data files to feed into Data Desk — it’s so nice to push a button and have the script merge together data from the database, from our phonathon database, from our broadcast email platform and other sources, as well as automatically create certain indicator variables, pivoting all kinds of categorical variables and handling missing data elegantly. Preparing this file using more manual methods would take days.

 

But this doesn’t automate exploration of the data, it doesn’t remove the need to be careful about preparing data to answer the business question, and it does absolutely nothing to help define that business question. Rather than let a script grind unsupervised through the data to spit out a result seconds later without any subject-matter expertise being applied, the real work of building a model is still done manually, in Data Desk, and right now I doubt there is a better way.

 

When it comes to professional development, then, all I can say is, “to each their own.” There is no one best route. The important thing is to ensure that motivated professionals are matched to training that is a good fit with their aptitudes and with the real needs of the organization.

 

18 January 2015

Why blog? Six reasons and six cautions

Filed under: CoolData, Off on a tangent, Training / Professional Development — Tags: , — kevinmacdonell @ 4:12 pm

THE two work-related but extracurricular activities I have found the most rewarding, personally and professionally, are giving conference presentations and writing for CoolData. I’ve already written about the benefits of presenting at conferences, explaining why the pain is totally worth it. Today: six reasons why you might want to try blogging, followed by six (optional) pieces of advice.

I’ve been blogging for just over five years, and I can say that the best way to start, and stay started, is to seek out motives that are selfish. The type of motivation I’m thinking of is intrinsic, such as personal satisfaction, as opposed to extrinsic, such as aiming to have a ton of followers and making money. It’s a good selfish.

Three early reasons for getting started with a blog are:

1. Documenting your work: One of my initial reasons for starting was to have a place to keep snippets of knowledge in some searchable place. Specific techniques for manipulating data in Excel, for example. I have found myself referring to older published pieces to remind me how I carried out an analysis or when I need a block of SQL. A blog has the added benefit of being shareable, but if your purpose is personal documentation, it doesn’t matter if you have any audience at all.

2. Developing your thoughts: Few activities bring focus and clarity to your thoughts like writing about them. Some of my ideas on more abstract issues have been shaped and developed this way. Sometimes the office is not the best environment for this sort of reflective work. A blog can be a space for clarity. Again — no need for an audience.

3. Solidifying your learning: One of the best ways to learn something new is by teaching it to someone else. I may have had an uncertain grasp of multiple linear regression, for example, when I launched CoolData, but the exercise of trying to explain data mining concepts and techniques was a great way to get it all straight in my head. If I were to go back today and re-read some of my early posts on the subject, which I rarely do, I would find things I probably would disagree with. But the likelihood of being wrong is not a good enough reason to avoid putting your thoughts out there. Being naive and wrong about things is a stage of learning.

Let’s say that, motivated by these or other reasons, you’ve published a few posts. Suddenly you’ve got something to share with the world. Data analysis lends itself perfectly to discussion via blogs. Not only analysts and data miners, but programmers, prospect researchers, business analysts, and just about anyone engaged in knowledge work can benefit personally while enriching their profession by sharing their thoughts with their peers online.

As you slowly begin to pick up readers, new reasons for blogging will emerge. Three more reasons for blogging are:

4. Making professional connections: As a direct result of writing the blog I have met all kinds of interesting people in the university advancement, non-profit, and data analysis worlds. Many I’ve met only virtually, others I’ve been fortunate to meet in person. It wasn’t very long after I started blogging that people would approach me at conferences to say they had seen one of my posts. Some of them learned a bit from me, or more likely I learned from them. A few have even found time to contribute a guest post.

5. Sharing knowledge: This is the obvious one, so no need to say much more. Many advancement professionals share online already, via various listservs and discussion forums. The fact this sharing goes on all the time makes me wonder why more people don’t try to make their contributions go even farther by taking the extra step of developing them into blog posts that can be referred to anytime.

6. Building toward larger projects: If you keep at it, slowly but surely you will build up a considerable body of work. Blogging can feed into conference presentations, discussion papers, published articles, even a book.

Let me return to the distinction I made earlier between intrinsic and extrinsic motivators — the internal, more personal rewards of blogging versus the external, often monetary, goals some people have. As it happens, the personal reasons for blogging are realistic, with a high probability of success, while the loftier goals are likely to lead to premature disillusionment. A new blog with no audience is a fragile thing; best not burden it with goals you cannot hope to realize in the first few years.

I consider CoolData a success, but not by any external measure. I simply don’t know how many followers a blog about data analysis for higher education advancement ought to have, and I don’t worry about it. I don’t have goals for number of visitors or subscribers, or even number of books sold. (Get your copy of “Score!” here. … OK — couldn’t resist.)

The blog does what I want it to do.

That’s mostly what I have to say, really. I have a few bits of advice, but my strongest advice is to ignore what everybody else thinks you should do, including me. Most expert opinion on posting frequency, optimum length for posts, ideal days and times for publishing, click-bait headlines, search engine optimization and the like is a lot of hot air.

If you’re still with me, here are a few cautions and pieces of advice, take it or leave it:

1. On covering your butt: Some employers take a dim view of their employees publishing blogs and discussing work-related issues on social media. You might want to clear your activity with your supervisor first. When I changed jobs, I disclosed that I intended to keep up my blog. I explained that connecting with counterparts at other universities was a big part of my professional development. There’s never been an issue. Be clear that you’re writing for a small readership of professionals who share your interests, an activity not unlike giving a conference presentation. Any enlightened organization should embrace someone who takes the initiative. (You could blog secretly and anonymously, but what’s the point?)

2. On “permission”: Beyond ensuring that you are not jeopardizing your day job, you do not require anyone’s permission. You don’t have to be an expert; you simply have to be interested in your subject and enthusiastic about sharing your new knowledge with others. Beginners have an advantage over experts when it comes to blogging; an expert will often struggle to relate to beginners, and assume too much about what they know or don’t know. So what if that post from two years ago embarrasses you now? You can always just delete it. If you’re reticent about speaking up, remember that blogging is not about claiming to be an authority on anything. It’s about exploring and sharing. It’s about promoting helpful ideas and approaches. You can’t prevent small minds from interpreting your activity as self-promotion, so just keep writing. In the long run, it’s the people who never take the risk of putting themselves out there who pay the higher price.

3. On writing: The interwebs ooze with advice for writers so I won’t add to the noise. I’ll just say that, although writing well can help, you don’t need to be an exceptional stylist. I read a lot of informative yet sub-par prose every day. The misspellings, mangled English, and infelicities that would be show-stoppers if I were reading a novel just aren’t that important when I’m reading for information that will help me do my job.

4. On email: In the early days of email I thought it rude not to respond. Today things are different: It’s just too easy to bombard people. Don’t get me wrong: I have received many interesting questions from readers (some of which have led to new posts, which I love), as well as great opportunities to connect, participate in projects, and so on. But just because you make yourself available for interaction doesn’t mean you need to answer every email. You can lay out the ground rules on an “About” page. If someone can’t be bothered to consider your guidelines for contact, then an exchange with that person is not going to be worth the trouble. On my “About this Blog” page I make it clear that I don’t review books or software, yet the emails offering me free stuff for review keep coming. I have no problem deleting those emails unanswered. … Then there are emails that I fully intend to respond to, but don’t get the chance. Before long they are buried in my inbox and forgotten. I do regret that a little, but I don’t beat myself up over it. (However — I do hereby apologize.)

5. On protecting your time: Regardless of how large or small your audience, eventually people will ask you to do things. Sometimes this can lead to interesting partnerships that advance the interests of both parties, but choose wisely and say no often. Be especially wary of quid pro quo arrangements that involve free stuff. I rarely read newspaper travel writing because I know so much of it is bought and paid for by tour companies, hotels, restaurants and so on, without disclosure. However, I’m less concerned about high-minded integrity than I am about taking on extra burdens. I’m a busy guy, and also a lazy guy who jealously guards his free time, so I’m careful about being obliged to anyone, either contractually or morally. Make sure your agenda is set exclusively by whatever has your full enthusiasm. You want your blogging to be a free activity, where no one but you calls the shots.

6. On the peanut gallery: Keeping up a positive conversation with people who are receptive to your message is productive. Trying to convince skeptics and critics who are never going to agree with you is not. When you’re pushing back, you’re not pushing forward. Keep writing for yourself and the people who want to hear what you’ve got to say, and ignore the rest. This has nothing to do with being nice or avoiding conflict. I don’t care if you’re nice. It’s about applying your energies in a direction where they are likely to produce results. Focus on being positive and enabling others with solutions and knowledge, not on indulging in opinions, fruitless debates, and pointless persiflage among the trolls in the comments section. I haven’t always followed my own advice, but I try.

Some say “know your audience.” Actually, it would be better if you know yourself. Readers respond to your personality and they can only get to know you if you are consistent. You can only be consistent if you are genuine. There are 7.125 billion people in the world and almost half of them have an internet connection (and access to Google Translate). Some of those will become your readers — be true to them by being true to yourself. There is no need to waste your time chasing the crowd.

Your overarching goals are not to convince or convert or market, but to 1) fuel your own growth, and 2) connect with like-minded people. Growth and connection: That’s more than enough payoff for me.

7 January 2015

New finds in old models

 

When you build a predictive model, you can never be sure it’s any good until it’s too late. Deploying a mediocre model isn’t the worst mistake you can make, though. The worst mistake would be to build a second mediocre model because you haven’t learned anything from the failure of the first.

 

Performance against a holdout data set for validation is not a reliable indicator of actual performance after deployment. Validation may help you decide which of two or more competing models to use, or it may provide reassurance that your one model isn’t total junk. It’s not proof of anything, though. Those lovely predictors, highly correlated with the outcome, could be fooling you. There are no guarantees they’re predictive of results over the year to come.

 

In the end, the only real evidence of a model’s worth is how it performs on real results. The problem is, those results happen in the future. So what is one to do?

 

I’ve long been fascinated with Planned Giving likelihood. Making a bequest seems like the ultimate gesture of institutional affinity (ultimate in every sense). On the plus side, that kind of affinity ought to be clearly evidenced in behaviours such as event attendance, giving, volunteering and so on. On the negative side, Planned Giving interest is uncommon enough that comparing expectancies with non-expectancies will sometimes lead to false predictors based on sparse data. For this reason, my goal of building a reliable model for predicting Planned Giving likelihood has been elusive.

 

Given that a validation data set taken from the same time period as the training data can produce misleading correlations, I wondered whether I could do one better: That is, be able to draw my holdout sample not from data of the same time period as that used to build the model, but from the future.

 

As it turned out, yes, I could.

 

Every year I save my regression analyses as Data Desk files. Although I assess the performance of the output scores, I don’t often go back to the model files themselves. However, they’re there as a document of how I approached modelling problems in the past. As a side benefit, each file is also a snapshot of the alumni population at that point in time. These data sets may consist of a hundred or more candidate predictor variables — a well-rounded picture.

 

My thinking went like this: Every old model file represents data from the past. If I pretend that this snapshot is really the present, then in order to have access to knowledge of the future, all I have to do is look at today’s data stored in the database.

 

For example, for this blog post, I reached back two years to a model I created in Data Desk for predicting likelihood to upgrade to the Leadership level in Annual Giving. I wasn’t interested in the model itself. Rather, I wanted to examine the underlying variables I had to work with at the time. This model had been an ambitious undertaking, with some 170 variables prepared for analysis. Many of course were transformations of variables or combinations of interacting variables. Among all those variables was one indicating whether a case was a current Planned Giving expectancy or not, at that point in time.

 

In this snapshot of the database from two years ago, some of the cases that were not expectancies would have become so since then. In other words, I now had the best of both worlds. I had a comprehensive set of potential predictors as they existed two years ago, AND access to the hitherto unknowable future: The identities of the people who had become expectancies after the predictors had been frozen in time.

 

As I said, my old model was not intended to predict Planned Giving inclination. So I built a new model, using “Is an Expectancy” (0/1) as the target variable. I trained the regression model on the two-year-old expectancy data — I didn’t even look at the new expectancies while building the model. No: I used those new expectancies as my validation data set.

 

“Validation” might be too strong a word, given that there were only 80 or so new cases. That’s a lot of bequest intentions, for sure, but in terms of data it’s a drop in the bucket compared with the number of cases being scored. Let’s call it a test data set. I used this test set to help me analyze the model, in a couple of ways.

 

First I looked at how new expectancies were scored by the model I had just built. The chart below shows their distribution by score decile. Slightly more than 50% of new expectancies were in the top decile. This looks pretty good — keeping in mind that this is what actual performance would have looked like had I really built this model two years ago (which I could have):

 

 

new_expec

(Even better, looking at percentiles, most of the expectancies in that top 10% are concentrated nicely in the top few percentiles.)

 

But I didn’t stop there. It is also evident that almost half of new expectancies fell outside the top 10 percent of scores, so clearly there was room for improvement. My next step was to examine the individual predictors I had used in the model. These were of course the predictors most highly correlated with being an expectancy. They were roughly the following:
  • Year person’s personal information in the database was last updated
  • Number of events attended
  • Age
  • Year of first gift
  • Number of alumni activities
  • Indicated “likely to donate” on 2009 alumni survey
  • Total giving in last five years (log transformed)
  • Combined length of name Prefix + Suffix

 

I ranked the correlation of each of these with the 0/1 indicator meaning “new expectancy,” and found that most of the predictors were still fine, although they changed their order in the rank correlation. Donor likelihood (from survey) and recent giving were more important, and alumni activities and how recently a person’s record was updated were less important.

 

This was interesting and useful, but what was even more useful was looking at the correlations between ALL potential predictors and the state of being a new expectancy. A number of predictors that would have been too far down the ranked list to consider using two years ago were suddenly looking much better. In particular, many variables related to participation in alumni surveys bubbled closer to the top as potentially significant.

 

This exercise suggests a way to proceed with iterative, yearly improvements to some of your standard models:
  • Dig up an old model from a year or more ago.
  • Query the database for new cases that represent the target variable, and merge them with the old datafile.
  • Assess how your model performed or, if you created more than one model, see which model would have performed best. (You should be doing this anyway.)
  • Go a layer deeper, by studying the variables that went into those models — the data “as it was” — to see which variables had correlations that tricked you into believing they were predictive, and which variables truly held predictive power but may have been overlooked.
  • Apply what you learn to the next iteration of the model. Leave out the variables with spurious correlations, and give special consideration to variables that may have been underestimated before.

13 November 2014

How to measure the rate of increasing giving for major donors

Filed under: John Sammis, Major Giving, Peter Wylie, RFM — Tags: , , , , , , — kevinmacdonell @ 12:35 pm

Not long ago, this question came up on the Prospect-DMM list, generating some discussion: How do you measure the rate of increasing giving for donors, i.e. their “velocity”? Can this be used to find significant donors who are poised to give more? This question got Peter Wylie thinking, and he came up with a simple way to calculate an index that is a variation on the concept of “recency” — like the ‘R’ in an RFM score, only much better.

This index should let you see that two donors whose lifetime giving is the same can differ markedly in terms of the recency of their giving. That will help you decide how to go after donors who are really on a roll.

You can download a printer-friendly PDF of Peter’s discussion paper here: An Index of Increasing Giving for Major Donors

 

6 October 2014

Don’t worry, just do it

2014-10-03 09.45.37People trying to learn how to do predictive modelling on the job often need only one thing to get them to the next stage: Some reassurance that what they are doing is valid.

Peter Wylie and I are each just back home, having presented at the fall conference of the Illinois chapter of the Association of Professional Researchers for Advancement (APRA-IL), hosted at Loyola University Chicago. (See photos, below!) Following an entertaining and fascinating look at the current and future state of predictive analytics presented by Josh Birkholz of Bentz Whaley Flessner, Peter and I gave a live demo of working with real data in Data Desk, with the assistance of Rush University Medical Center. We also drew names to give away a few copies of our book, Score! Data-Driven Success for Your Advancement Team.

We were impressed by the variety and quality of questions from attendees, in particular those having to do with stumbling blocks and barriers to progress. It was nice to be able to reassure people that when it comes to predictive modelling, some things aren’t worth worrying about.

Messy data, for example. Some databases, particularly those maintained by non higher ed nonprofits, have data integrity issues such as duplicate records. It would be a shame, we said, if data analysis were pushed to the back burner just because of a lack of purity in the data. Yes, work on improving data integrity — but don’t assume that you cannot derive valuable insights right now from your messy data.

And then the practice of predictive modelling itself … Oh, there is so much advice out there on the net, some of it highly technical and involving a hundred different advanced techniques. Anyone trying to learn on their own can get stymied, endlessly questioning whether what they’re doing is okay.

For them, our advice was this: In our field, you create value by ranking constituents according to their likelihood to engage in a behaviour of interest (giving, usually), which guides the spending of scarce resources where they will do the most good. You can accomplish this without the use of complex algorithms or arcane math. In fact, simpler models are often better models.

The workhorse tool for this task is multiple linear regression. A very good stand-in for regression is building a simple score using the techniques outlined in Peter’s book, Data Mining for Fundraisers. Sticking to the basics will work very well. Fussing with technical issues or striving for a high degree of accuracy are distractions that the beginner need not be overly concerned with.

If your shop’s current practice is to pick prospects or other targets by throwing darts, then even the crudest model will be an improvement. In many situations, simply performing better than random will be enough to create value. The bottom line: Just do it. Worry about perfection some other day.

If the decisions are high-stakes, if the model will be relied on to guide the deployment of scarce resources, then insert another step in the process. Go ahead and build the model, but don’t use it. Allow enough time of “business as usual” to elapse. Then, gather fresh examples of people who converted to donors, agreed to a bequest, or made a large gift — whatever the behaviour is you’ve tried to predict — and check their scores:

  • If the chart shows these new stars clustered toward the high end of scores, wonderful. You can go ahead and start using the model.
  • If the result is mixed and sort of random-looking, then examine where it failed. Reexamine each predictor you used in the model. Is the historical data in the predictor correlated with the new behaviour? If it isn’t, then the correlation you observed while building the model may have been spurious and led you astray, and should be excluded. As well, think hard about whether the outcome variable in your model is properly defined: That is, are you targeting for the right behaviour? If you are trying to find good prospects for Planned Giving, for example, your outcome variable should focus on that, and not lifetime giving.

“Don’t worry, just do it” sounds like motivational advice, but it’s more than that. The fact is, there is only so much model validation you can do at the time you create the model. Sure, you can hold out a generous number of cases as a validation sample to test your scores with. But experience will show you that your scores will always pass the validation test just fine — and yet the model may still be worthless.

A holdout sample of data that is contemporaneous with that used to train the model is not the same as real results in the future. A better way to go might be to just use all your data to train the model (no holdout sample), which will result in a better model anyway, especially if you’re trying to predict something relatively uncommon like Planned Giving potential. Then, sit tight and observe how it does in production, or how it would have done in production if it had been deployed.

  1. Observe, learn, tweak, and repeat. Errors are hard to avoid, but they can be discovered.
  2. Trust the process, but verify the results. What you’re doing is probably fine. If it isn’t, you’ll get a chance to find out.
  3. Don’t sweat the small stuff. Make a difference now by sticking to basics and thinking of the big picture. You can continue to delve and explore technical refinements and new methods, if that’s where your interest and aptitude take you. Data analysis and predictive modelling are huge subjects — start where you are, where you can make a difference.

* A heartfelt thank you to APRA-IL and all who made our visit such a pleasure, especially Sabine Schuller (The Rotary Foundation), Katie Ingrao and Viviana Ramirez (Rush University Medical Center), Leigh Peterson Visaya (Loyola University Chicago), Beth Witherspoon (Elmhurst College), and Rodney P. Young, Jr. (DePaul University), who took the photos you see below. (See also: APRA IL Fall Conference Datapalooza.)

Click on any of these for a full-size image.

DSC_0017 DSC_0018 DSC_0026 DSC_0051 DSC_0054 DSC_0060 DSC_0066 DSC_0075 DSC_0076 DSC_0091

22 September 2014

What predictor variables should you avoid? Depends on who you ask

People who build predictive models will tell you that there are certain variables you should avoid using as predictors. I am one of those people. However, we disagree on WHICH variables one should avoid, and increasingly this conflicting advice is confusing those trying to learn predictive modeling.

The differences involve two points in particular. Assuming charitable giving is the behaviour we’re modelling for, those two things are:

  1. Whether we should use past giving to predict future giving, and
  2. Whether attributes such as marital status are really predictors of giving.

I will offer my opinions on both points. Note that they are opinions, not definitive answers.

1. Past giving as a predictor

I have always stressed that if you are trying to predict “giving” using a multiple linear regression model, you must avoid using “giving” as a predictor among your independent variables. That includes anything that is a proxy for “giving,” such as attendance at a donor-thanking event. This is how I’ve been taught and that is what I’ve adhered to in practice.

Examples that violate this practice keep popping up, however. I have an email from Atsuko Umeki, IT Coordinator in the Development Office of the University of Victoria in Victoria, British Columbia*. She poses this question about a post I wrote in July 2013:

“In this post you said, ‘In predictive models, giving and variables related to the activity of giving are usually excluded as variables (if ‘giving’ is what we are trying to predict). Using any aspect of the target variable as an input is bad practice in predictive modelling and is carefully avoided.’  However, in many articles and classes I read and took I was advised or instructed to include past giving history such as RFA*, Average gift, Past 3 or 5 year total giving, last gift etc. Theoretically I understand what you say because past giving is related to the target variable (giving likelihood); therefore, it will be biased. But in practice most practitioners include past giving as variables and especially RFA seems to be a good variable to include.”

(* RFA is a variation of the more familiar RFM score, based on giving history — Recency, Frequency, and Monetary value.)

So modellers-in-training are being told to go ahead and use ‘giving’ to predict ‘giving’, but that’s not all: Certain analytics vendors also routinely include variables based on past giving as predictors of future giving. Not long ago I sat in on a webinar hosted by a consultant, which referenced the work of one well-known analytics vendor (no need to name the vendor here) in which it seemed that giving behaviour was present on both sides of the regression equation. Not surprisingly, this vendor “achieved” a fantastic R-squared value of 86%. (Fantastic as in “like a fantasy,” perhaps?)

This is not as arcane or technical as it sounds. When you use giving to predict giving, you are essentially saying, “The people who will make big gifts in the future are the ones who have made big gifts in the past.” This is actually true! The thing is, you don’t need a predictive model to produce such a prospect list; all you need is a list of your top donors.

Now, this might be reassuring to whomever is paying a vendor big bucks to create the model. That person sees names they recognize, and they think, ah, good — we are not too far off the mark. And if you’re trying to convince your boss of the value of predictive modelling, he or she might like to see the upper ranks filled with familiar names.

I don’t find any of that “reassuring.” I find it a waste of time and effort — a fancy and expensive way to produce a list of the usual suspects.

If you want to know who has given you a lot of money, you make a list of everyone in your database and sort it in descending order by total amount given. If you want to predict who in your database is most likely to give you a lot of money in the future, build a predictive model using predictors that are associated with having given large amounts of money. Here is the key point … if you include “predictors” that mean the same thing as “has given a lot of money,” then the result of your model is not going to look like a list of future givers — it’s going to look more like your historical list of past givers.

Does that mean you should ignore giving history? No! Ideally you’d like to identify the donors who have made four-figure gifts who really have the capacity and affinity to make six-figure gifts. You won’t find them using past giving as a predictor, because your model will be blinded by the stars. The variables that represent giving history will cause all other affinity-related variables to pale in comparison. Many will be rejected from the model for being not significant or for adding nothing additional to the model’s ability to explain the variance in the outcome variable.

To sum up, here are the two big problems with using past giving to predict future giving:

  1. The resulting insights are sensible but not very interesting: People who gave before tend to give again. Or, stated another way: “Donors will be donors.” Fundraisers don’t need data scientists to tell them that.
  2. Giving-related independent variables will be so highly correlated with giving-related dependent variables that they will eclipse more subtle affinity-related variables. Weaker predictors will end up getting kicked out of our regression analysis because they can’t move the needle on R-squared, or because they don’t register as significant. Yet, it’s these weaker variables that we need in order to identify new prospects.

Let’s try a thought experiment. What if I told you that I had a secret predictor that, once introduced into a regression analysis, could explain 100% of the variance in the dependent variable ‘Lifetime Giving’? That’s right — the highest value for R-squared possible, all with a single predictor. Would you pay me a lot of money for that? What is this magic variable that perfectly models the variance in ‘Lifetime Giving’? Why, it is none other than ‘Lifetime Giving’ itself! Any variable is perfectly correlated with itself, so why look any farther?

This is an extreme example. In a real predictive model, a predictor based on giving history would be restricted to giving from the past, while the outcome variable would be calculated from a more recent period — the last year or whatever. There should be no overlap. R-squared would not be 100%, but it would be very high.

The R-squared statistic is useful for guiding you as you add variables to a regression analysis, or for comparing similar models in terms of fit with the data. It is not terribly useful for deciding whether any one model is good or bad. A model with an R-squared of 15% may be highly valuable, while one with R-squared of 75% may be garbage. If a vendor is trying to sell you on a model they built based on a high R-squared alone, they are misleading you.

The goal of predictive modeling for major gifts is not to maximize R-squared. It’s to identify new prospects.

2. Using “attributes” as predictors

Another thing about that webinar bugged me. The same vendor advised us to “select variables with caution, avoiding ‘descriptors’ and focusing on potential predictors.” Specifically, we were warned that a marital status of ‘married’ will emerge as correlated with giving. Don’t be fooled! That’s not a predictor, they said.

So let me get this straight. We carry out an analysis that reveals that married people are more likely to give large gifts, that donors with more than one degree are more likely to give large gifts, that donors who have email addresses and business phone numbers in the database are more likely to give large gifts … but we are supposed to ignore all that?

The problem might not be the use of “descriptors,” the problem might be with the terminology. Maybe we need to stop using the word “predictor”. One experienced practitioner, Alexander Oftelie, briefly touched on this nuance in a recent blog post. I quote, (emphasis added by me):

“Data that on its own may seem unimportant — the channel someone donates, declining to receive the mug or calendar, preferring email to direct mail, or making ‘white mail’ or unsolicited gifts beyond their sustaining-gift donation — can be very powerful when they are brought together to paint a picture of engagement and interaction. Knowing who someone is isn’t by itself predictive (at best it may be correlated). Knowing how constituents choose to engage or not engage with your organization are the most powerful ingredients we have, and its already in our own garden.”

I don’t intend to critique Alexander’s post, which isn’t even on this particular topic. (It’s a good one — please read it.) But since he’s written this, permit me scratch my head about it a bit.

In fact, I think I agree with him that there is a distinction between a behaviour and a descriptor/attribute. A behaviour, an action taken at a specific point in time (eg., attending an event), can be classified as a predictor. An attribute (“who someone is,” eg., whether they are married or single) is better described as a correlate. I would also be willing to bet that if we carefully compared behavioural variables to attribute variables, the behaviours would outperform, as Alexander says.

In practice, however, we don’t need to make that distinction. If we are using regression to build our models, we are concerned solely and completely with correlation. To say “at best it may be correlated” suggests that predictive modellers have something better at their disposal that they should be using instead of correlation. What is it? I don’t know, and Alexander doesn’t say.

If in a given data set, we can demonstrate that being married is associated with likelihood to make a donation, then it only makes sense to use that variable in our model. Choosing to exclude it based on our assumption that it’s an attribute and not a behaviour doesn’t make business sense. We are looking for practical results, after all, not chasing some notion of purity. And let’s not fool ourselves, or clients, that we are getting down to causation. We aren’t.

Consider that at least some “attributes” can be stated in terms of a behaviour. People get married — that’s a behaviour, although not related to our institution. People get married and also tell us about it (or allow it to be public knowledge so that we can record it) — that’s also a behaviour, and potentially an interaction with us. And on the other side of the coin, behaviours or interactions can be stated as attributes — a person can be an event attendee, a donor, a taker of surveys.

If my analysis informs me that widowed female alumni over the age of 60 are extremely good candidates for a conversation about Planned Giving, then are you really going to tell me I’m wrong to act on that information, just because sex, age and being widowed are not “behaviours” that a person voluntarily carries out? Mmmm — sorry!

Call it quibbling over semantics if you like, but don’t assume it’s so easy to draw a circle around true predictors. There is only one way to surface predictors, which is to take a snapshot of all potentially relevant variables at a point in time, then gather data on the outcome you wish to predict (eg., giving) after that point in time, and then assess each variable in terms of the strength of association with that outcome. The tools we use to make that assessment are nothing other than correlation and significance. Again, if there are other tools in common usage, then I don’t know about them.

Caveats and concessions

I don’t maintain that this or that practice is “wrong” in all cases, nor do I insist on rules that apply universally. There’s a lot of art in this science, after all.

Using giving history as a predictor:

  • One may use some aspects of giving to predict outcomes that are not precisely the same as ‘Giving’, for example, likelihood to enter into a Planned Giving arrangement. The required degree of difference between predictors and outcome is a matter of judgement. I usually err on the side of scrupulously avoiding ANY leakage of the outcome side of the equation into the predictor side — but sure, rules can be bent.
  • I’ve explored the use of very early giving (the existence and size of gifts made by donors before age 30) to predict significant giving late in life. (See Mine your donor data with this baseball-inspired analysis.) But even then, I don’t use that as a variable in a model; it’s more of a flag used to help select prospects, in addition to modeling.

Using descriptors/attributes as predictors:

  • Some variables of this sort will appear to have subtly predictive effects in-model, effects that disappear when the model is deployed and new data starts coming in. That’s regrettable, but it’s something you can learn from — not a reason to toss all such variables into the trash, untested. The association between marital status and giving might be just a spurious correlation — or it might not be.
  • Business knowledge mixed with common sense will help keep you out of trouble. A bit of reflection should lead you to consider using ‘Married’ or ‘Number of Degrees’, while ignoring ‘Birth Month’ or ‘Eye Colour’. (Or astrological sign!)

There are many approaches one can take with predictive modeling, and naturally one may feel that one’s chosen method is “best”. The only sure way to proceed is to take the time to define exactly what you want to predict, try more than one approach, and then evaluate the performance of the scores when you have actual results available — which could be a year after deployment. We can listen to what experts are telling us, but it’s more important to listen to what the data is telling us.

//////////

Note: When I originally posted this, I referred to Atsuko Umeki as “he”. I apologize for this careless error and for whatever erroneous assumption that must have prompted it.

« Newer PostsOlder Posts »

The Silver is the New Black Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 1,177 other followers