CoolData blog

22 September 2014

What predictor variables should you avoid? Depends on who you ask

People who build predictive models will tell you that there are certain variables you should avoid using as predictors. I am one of those people. However, we disagree on WHICH variables one should avoid, and increasingly this conflicting advice is confusing those trying to learn predictive modeling.

The differences involve two points in particular. Assuming charitable giving is the behaviour we’re modelling for, those two things are:

  1. Whether we should use past giving to predict future giving, and
  2. Whether attributes such as marital status are really predictors of giving.

I will offer my opinions on both points. Note that they are opinions, not definitive answers.

1. Past giving as a predictor

I have always stressed that if you are trying to predict “giving” using a multiple linear regression model, you must avoid using “giving” as a predictor among your independent variables. That includes anything that is a proxy for “giving,” such as attendance at a donor-thanking event. This is how I’ve been taught and that is what I’ve adhered to in practice.

Examples that violate this practice keep popping up, however. I have an email from Atsuko Umeki, IT Coordinator in the Development Office of the University of Victoria in Victoria, British Columbia*. She poses this question about a post I wrote in July 2013:

“In this post you said, ‘In predictive models, giving and variables related to the activity of giving are usually excluded as variables (if ‘giving’ is what we are trying to predict). Using any aspect of the target variable as an input is bad practice in predictive modelling and is carefully avoided.’  However, in many articles and classes I read and took I was advised or instructed to include past giving history such as RFA*, Average gift, Past 3 or 5 year total giving, last gift etc. Theoretically I understand what you say because past giving is related to the target variable (giving likelihood); therefore, it will be biased. But in practice most practitioners include past giving as variables and especially RFA seems to be a good variable to include.”

(* RFA is a variation of the more familiar RFM score, based on giving history — Recency, Frequency, and Monetary value.)

So modellers-in-training are being told to go ahead and use ‘giving’ to predict ‘giving’, but that’s not all: Certain analytics vendors also routinely include variables based on past giving as predictors of future giving. Not long ago I sat in on a webinar hosted by a consultant, which referenced the work of one well-known analytics vendor (no need to name the vendor here) in which it seemed that giving behaviour was present on both sides of the regression equation. Not surprisingly, this vendor “achieved” a fantastic R-squared value of 86%. (Fantastic as in “like a fantasy,” perhaps?)

This is not as arcane or technical as it sounds. When you use giving to predict giving, you are essentially saying, “The people who will make big gifts in the future are the ones who have made big gifts in the past.” This is actually true! The thing is, you don’t need a predictive model to produce such a prospect list; all you need is a list of your top donors.

Now, this might be reassuring to whomever is paying a vendor big bucks to create the model. That person sees names they recognize, and they think, ah, good — we are not too far off the mark. And if you’re trying to convince your boss of the value of predictive modelling, he or she might like to see the upper ranks filled with familiar names.

I don’t find any of that “reassuring.” I find it a waste of time and effort — a fancy and expensive way to produce a list of the usual suspects.

If you want to know who has given you a lot of money, you make a list of everyone in your database and sort it in descending order by total amount given. If you want to predict who in your database is most likely to give you a lot of money in the future, build a predictive model using predictors that are associated with having given large amounts of money. Here is the key point … if you include “predictors” that mean the same thing as “has given a lot of money,” then the result of your model is not going to look like a list of future givers — it’s going to look more like your historical list of past givers.

Does that mean you should ignore giving history? No! Ideally you’d like to identify the donors who have made four-figure gifts who really have the capacity and affinity to make six-figure gifts. You won’t find them using past giving as a predictor, because your model will be blinded by the stars. The variables that represent giving history will cause all other affinity-related variables to pale in comparison. Many will be rejected from the model for being not significant or for adding nothing additional to the model’s ability to explain the variance in the outcome variable.

To sum up, here are the two big problems with using past giving to predict future giving:

  1. The resulting insights are sensible but not very interesting: People who gave before tend to give again. Or, stated another way: “Donors will be donors.” Fundraisers don’t need data scientists to tell them that.
  2. Giving-related independent variables will be so highly correlated with giving-related dependent variables that they will eclipse more subtle affinity-related variables. Weaker predictors will end up getting kicked out of our regression analysis because they can’t move the needle on R-squared, or because they don’t register as significant. Yet, it’s these weaker variables that we need in order to identify new prospects.

Let’s try a thought experiment. What if I told you that I had a secret predictor that, once introduced into a regression analysis, could explain 100% of the variance in the dependent variable ‘Lifetime Giving’? That’s right — the highest value for R-squared possible, all with a single predictor. Would you pay me a lot of money for that? What is this magic variable that perfectly models the variance in ‘Lifetime Giving’? Why, it is none other than ‘Lifetime Giving’ itself! Any variable is perfectly correlated with itself, so why look any farther?

This is an extreme example. In a real predictive model, a predictor based on giving history would be restricted to giving from the past, while the outcome variable would be calculated from a more recent period — the last year or whatever. There should be no overlap. R-squared would not be 100%, but it would be very high.

The R-squared statistic is useful for guiding you as you add variables to a regression analysis, or for comparing similar models in terms of fit with the data. It is not terribly useful for deciding whether any one model is good or bad. A model with an R-squared of 15% may be highly valuable, while one with R-squared of 75% may be garbage. If a vendor is trying to sell you on a model they built based on a high R-squared alone, they are misleading you.

The goal of predictive modeling for major gifts is not to maximize R-squared. It’s to identify new prospects.

2. Using “attributes” as predictors

Another thing about that webinar bugged me. The same vendor advised us to “select variables with caution, avoiding ‘descriptors’ and focusing on potential predictors.” Specifically, we were warned that a marital status of ‘married’ will emerge as correlated with giving. Don’t be fooled! That’s not a predictor, they said.

So let me get this straight. We carry out an analysis that reveals that married people are more likely to give large gifts, that donors with more than one degree are more likely to give large gifts, that donors who have email addresses and business phone numbers in the database are more likely to give large gifts … but we are supposed to ignore all that?

The problem might not be the use of “descriptors,” the problem might be with the terminology. Maybe we need to stop using the word “predictor”. One experienced practitioner, Alexander Oftelie, briefly touched on this nuance in a recent blog post. I quote, (emphasis added by me):

“Data that on its own may seem unimportant — the channel someone donates, declining to receive the mug or calendar, preferring email to direct mail, or making ‘white mail’ or unsolicited gifts beyond their sustaining-gift donation — can be very powerful when they are brought together to paint a picture of engagement and interaction. Knowing who someone is isn’t by itself predictive (at best it may be correlated). Knowing how constituents choose to engage or not engage with your organization are the most powerful ingredients we have, and its already in our own garden.”

I don’t intend to critique Alexander’s post, which isn’t even on this particular topic. (It’s a good one – please read it.) But since he’s written this, permit me scratch my head about it a bit.

In fact, I think I agree with him that there is a distinction between a behaviour and a descriptor/attribute. A behaviour, an action taken at a specific point in time (eg., attending an event), can be classified as a predictor. An attribute (“who someone is,” eg., whether they are married or single) is better described as a correlate. I would also be willing to bet that if we carefully compared behavioural variables to attribute variables, the behaviours would outperform, as Alexander says.

In practice, however, we don’t need to make that distinction. If we are using regression to build our models, we are concerned solely and completely with correlation. To say “at best it may be correlated” suggests that predictive modellers have something better at their disposal that they should be using instead of correlation. What is it? I don’t know, and Alexander doesn’t say.

If in a given data set, we can demonstrate that being married is associated with likelihood to make a donation, then it only makes sense to use that variable in our model. Choosing to exclude it based on our assumption that it’s an attribute and not a behaviour doesn’t make business sense. We are looking for practical results, after all, not chasing some notion of purity. And let’s not fool ourselves, or clients, that we are getting down to causation. We aren’t.

Consider that at least some “attributes” can be stated in terms of a behaviour. People get married — that’s a behaviour, although not related to our institution. People get married and also tell us about it (or allow it to be public knowledge so that we can record it) — that’s also a behaviour, and potentially an interaction with us. And on the other side of the coin, behaviours or interactions can be stated as attributes — a person can be an event attendee, a donor, a taker of surveys.

If my analysis informs me that widowed female alumni over the age of 60 are extremely good candidates for a conversation about Planned Giving, then are you really going to tell me I’m wrong to act on that information, just because sex, age and being widowed are not “behaviours” that a person voluntarily carries out? Mmmm — sorry!

Call it quibbling over semantics if you like, but don’t assume it’s so easy to draw a circle around true predictors. There is only one way to surface predictors, which is to take a snapshot of all potentially relevant variables at a point in time, then gather data on the outcome you wish to predict (eg., giving) after that point in time, and then assess each variable in terms of the strength of association with that outcome. The tools we use to make that assessment are nothing other than correlation and significance. Again, if there are other tools in common usage, then I don’t know about them.

Caveats and concessions

I don’t maintain that this or that practice is “wrong” in all cases, nor do I insist on rules that apply universally. There’s a lot of art in this science, after all.

Using giving history as a predictor:

  • One may use some aspects of giving to predict outcomes that are not precisely the same as ‘Giving’, for example, likelihood to enter into a Planned Giving arrangement. The required degree of difference between predictors and outcome is a matter of judgement. I usually err on the side of scrupulously avoiding ANY leakage of the outcome side of the equation into the predictor side — but sure, rules can be bent.
  • I’ve explored the use of very early giving (the existence and size of gifts made by donors before age 30) to predict significant giving late in life. (See Mine your donor data with this baseball-inspired analysis.) But even then, I don’t use that as a variable in a model; it’s more of a flag used to help select prospects, in addition to modeling.

Using descriptors/attributes as predictors:

  • Some variables of this sort will appear to have subtly predictive effects in-model, effects that disappear when the model is deployed and new data starts coming in. That’s regrettable, but it’s something you can learn from — not a reason to toss all such variables into the trash, untested. The association between marital status and giving might be just a spurious correlation — or it might not be.
  • Business knowledge mixed with common sense will help keep you out of trouble. A bit of reflection should lead you to consider using ‘Married’ or ‘Number of Degrees’, while ignoring ‘Birth Month’ or ‘Eye Colour’. (Or astrological sign!)

There are many approaches one can take with predictive modeling, and naturally one may feel that one’s chosen method is “best”. The only sure way to proceed is to take the time to define exactly what you want to predict, try more than one approach, and then evaluate the performance of the scores when you have actual results available — which could be a year after deployment. We can listen to what experts are telling us, but it’s more important to listen to what the data is telling us.

//////////

Note: When I originally posted this, I referred to Atsuko Umeki as “he”. I apologize for this careless error and for whatever erroneous assumption that must have prompted it.

25 August 2014

Your nonprofit’s real ice bucket challenge

It was only a matter of time. Over the weekend, a longtime friend dumped a bucket of ice water over his head and posted the video to Facebook. He challenged three friends — me included — to take the Ice Bucket Challenge in support of ALS research. I passed on the cold shower, but this morning I did make a gift to ALS Canada, a cause I wouldn’t have supported had it not been for my friend Paul and the brilliant campaign he participated in.*

Universities and other charities are, of course, watching closely and asking themselves how they can replicate this phenomenon. Fine … I am skeptical that central planning and a modest budget can give birth to such a massive juggernaut of socially-responsible contagion … but I wish them luck.

While we can admire our colleagues’ amazing work and good fortune, I am not sure we should envy them. In the coming year, ALS charities will be facing a huge donor-retention issue. Imagine gaining between 1.5 and 2 million new donors in the span of a few months. Now, I have no knowledge of what ALS fundraisers really intend to do with their hordes of newly-acquired donors. Maybe retention is not a goal. But it is a sure thing that the world will move on to some other craze. Retaining a tiny fraction of these donors could make the difference between the ice bucket challenge being just a one-time, non-repeatable anomaly and turning it into a foundation for long-term support that permanently changes the game for ALS research.

Perhaps the ice bucket challenge can be turned into an annual event that becomes as established as the walks, runs and other participatory events that other medical-research charities have. Who knows.

For certain is that the majority of new donors will not give again. Also for certain is that it would be irresponsibly wasteful for charities to spread their retention budget equally over all new donors.

Which brings me to predictive modeling. Some portion of new donors WILL give again. Maybe something about the challenge touched them more deeply than the temporary fun of the ice bucket dare. Maybe they learned something about the disease. Maybe they know someone affected by ALS. There is no direct way to know. But I would be willing to bet that higher levels of engagement can be found in patterns in the data.

What factors might be predictors of longer-term engagement? It is not possible to say without some analysis, but sources of information might include:

  • How the donor arrived at the site prior to making a gift (following a link from another website, following a link via a social media platform, using a search engine).
  • How the donor became aware of the challenge (this is a question on some giving pages).
  • Whether they consented to future communications: Mail, email, or both.
  • Whether the donor continued on to a page on the website beyond the thank you page. (Did they start watching an ALS-related video and if so, how much of it did they watch?)
  • Whether the donor clicked on social media button to share the news of their gift, and where they shared it.

Shreds of ambiguous clues scattered here and there, admittedly, but that is what a good predictive model detects and amplifies. If it were up to me, I would also have asked on the giving page whether the donor had done the ice bucket thing. A year from now, my friend Paul is going to clearly remember the shock of pouring ice water over his head, plus the positive response he got on Facebook, and this will bring to mind his gift and the need to give again. My choosing not to do so might be associated with a lower level of commitment, and thus a lower likelihood of renewing. Just a theory.**

Data-informed segmentation aimed at getting a second gift from newly-acquired donors is not quite as sexy as being an internet meme. However, unlike riding the uncontrollable wave of a social media sensation, retention is something that charities might actually be able to plan for.

* I would like to see this phenomenon raise all boats for medical charities, therefore I also gave to Doctors Without Borders Canada and the Molly Appeal for Medical Research. Check them out.

** Update: I am told that actually, this question IS asked. I didn’t see it on the Canadian site, but maybe I just missed it. Great!

POSTSCRIPT

I was quoted on this topic in a story in the September 4th online edition of the Chronicle of Philanthropy. Link (subscribers only): After Windfall, ALS Group Grapples With 2.4-Million Donor Dilemma

19 August 2014

Score! … As pictured by you

Filed under: Book, Peter Wylie, Score! — Tags: , , — kevinmacdonell @ 7:25 pm
2014-07-18 06.41.41

Left to right: Elisa Shoenberger, Leigh Petersen Visaya, Rebekah O’Brien, and Alison Rane in Chicago. (Click for full size.)

During the long stretch of time that Peter Wylie and I were writing our book, Score! Data-Driven Success for Your Advancement Team, there were days when I thought that even if we managed to get the thing done, it might not be that great. There were just so many pieces that needed to fit together somehow … I guess we each didn’t want to let the other down, so we plugged on despite doubts and delays, and then, somehow, it got finished.

Whew, I thought. Washed my hands of that! I expected I would walk away from it,  move on to other projects, and be glad that I had my early mornings and weekends back.

That’s not what happened.

These few months later, my eye will still be caught now and then by the striking, colourful cover of the book sitting on my desk. It draws me to pick it up and flip through it — even re-read bits. I find myself thinking, “Hey, I like this.”

Of course, who cares, right? I am not the reader. However, whatever I might think about Score!, it has been even more gratifying for Peter and I to hear from folks who seem to like it as much as we do. How fun it has been to see that bright cover popping up in photos and on social media every once in a while.

I’ve collected a few of those photos and tweets here, along with some other images related to the book. Feel free to post your own “Score selfies” on Twitter using the hashtag #scorethebook. Or if you’re not into Twitter, send me a photo at kevin.macdonell@gmail.com.

Click here to order your copy of Score! from the CASE Bookstore.

2014-09-23 11.56.58

2014-08-14 06.34.25

jen

Jennifer Cunningham, Senior Director, Metrics+Marketing for the Office of Alumni Affairs, Cornell University. @jenlynham

Click here to order your copy of Score! from the CASE Bookstore.

While we would like for you to buy it, we would LOVE for you to read it and put it to work in your shop. Your buying it earns us each enough money to buy a cup of coffee. Your READING it furthers the reach and impact of ideas and concepts that fascinate us and which we love to share.

20 July 2014

Eliminate your duplicate data row problems with simple SQL

Filed under: Data, SQL — Tags: , , , , , , , — kevinmacdonell @ 1:38 pm

We’ve all faced this problem: Duplicate rows in our data. It messes up our reports and causes general confusion and consternation for people we supply data to. The good news is you can take care of this problem fairly easily in an SQL query using a handy function called LISTAGG. I discovered this almost by accident just this week, and I’m delighted to share.

Duplicate rows are a result of what we choose to include in our database query. If we are expecting to get only one row for each constituent ID, then we should be safe in asking for Gender, or Name Prefix, or Preferred Address. There should be only one of each of these things for each ID, and therefore there won’t be duplicate rows. These are examples of one-to-one relationships.

We get into trouble when we ask for something that can have multiple values for a single ID. That’s a one-to-many relationship. Examples: Any address (not just Preferred), Degree (some alumni will have more than one), Category Code (a person can be an alum and a parent and a staff member, all at the same time) … and so on. Below, the first constituent is duplicated on Category, and the second constituent is duplicated on Degree.

dup1

That’s not the worst of it; when one ID is duplicated on two or more elements, things get really messy. If A0001 above had three Category codes and two degrees, it would be appear six times (3 x 2) in the result set. Notice that we are really talking about duplicate IDs — not duplicate rows. Each row is, in fact, unique. This is perfectly correct behaviour on the part of the database and query. Try explaining that to your boss, however! You’ve got to do something to clean this up.

What isn’t going to work is pivoting. In SQL, the PIVOT function creates a new column for each unique category of some data element, plus an aggregation of the values for that category. I’ve written about this here: Really swell SQL: Why you must know PIVOT and WITH. Using PIVOT is going to give you something like the result below. (I’ve put in 1’s and nulls to indicate the presence or absence of data.)

dup2

 

What we really want is THIS:

dup3

The answer is this: String aggregation. You may already be familiar with aggregating quantitative values, usually by counting or summing. When we query on a donor’s giving, we usually don’t want a row for each gift — we want the sum of all giving. Easy. For strings — categorical data such as Degree or Category Code — it’s a different situation. Sure, we can aggregate by using counting. For example, I can ask for a count of Degree, and that will limit the data to one row per ID. But for most reporting scenarios, we want to display the data itself. We want to know what the degrees are, not just how many of them there are.

The following simplified SQL will get you the result you see above for ID and Category — one row per ID, and a series of category codes separated by commas. This code works in Oracle 11g — I can’t speak to other systems’ implementations of SQL. (For other Oracle-based methods, check out this page. To accomplish the same thing in a non-Oracle variant of SQL called T-SQL, scroll down to the Postscript below.)

Obviously you will need to replace the sample text with your own schema name, table names, and field identifiers.

SELECT
SCHEMA.ENTITY.ID,
LISTAGG ( SCHEMA.ENTITY.CATG_CODE, ', ' )
WITHIN GROUP ( ORDER BY SCHEMA.ENTITY.CATG_CODE ) AS CATEGORY

FROM SCHEMA.ENTITY

GROUP BY SCHEMA.ENTITY.ID

The two arguments given to LISTAGG are the field CATG_CODE and a string consisting of a comma and a space. The string, which can be anything you want, is inserted between each Category Code. On the next line, ORDER BY sorts the category codes in alphabetical order.

LISTAGG accepts only two arguments, but if you want to include two or more fields and/or strings in the aggregation, you can just concatenate them inside LISTAGG using “pipe” characters (||). The following is an example using Degree Code and Degree Code Description, with a space between them. To get even fancier, I have replaced the comma separator with the character code for “new line,” which puts each degree on its own line — handy for readability in a list or report. (This works fine when used in Custom SQL in Tableau. For display in Excel, you may have to use a carriage return — char(13) — instead of a line feed.) You will also notice that I have specified a join to a table containing the Degree data.

SELECT
SCHEMA.ENTITY.ID,
LISTAGG ( SCHEMA.DEGREE.DEGC_CODE || ' ' || SCHEMA.DEGREE.DEGC_DESC, chr(10) )
WITHIN GROUP ( ORDER BY SCHEMA.DEGREE.DEGC_CODE, SCHEMA.DEGREE.DEGC_DESC ) AS DEGREES

FROM SCHEMA.ENTITY

INNER JOIN 
SCHEMA.DEGREE 
ON ( SCHEMA.ENTITY.ID = SCHEMA.DEGREE.ID )

GROUP BY SCHEMA.ENTITY.ID

dup_final

Before I discovered this capability, the only other automated way I knew how to handle it was in Python. All the previous methods I’ve used were at least partially manual or just very cumbersome. I don’t know how long string aggregation in Oracle has been around, but I’m grateful that a data problem I have struggled with for years has finally been dispensed with. Hurrah!

~~~~~~

POSTSCRIPT

After publishing this I had an email response from David Logan, Director of Philanthropic Analytics at Children’s Mercy in Kansas City, Missouri. He uses TSQL, a proprietary procedural language used by Microsoft in SQL Server. Some readers might find this helpful, so I will quote from his email:

I was reading your post this morning about using the LISTAGG function in Oracle SQL to concatenate values from duplicate rows into a list. A couple of months ago I was grappling with this very problem so that I could create a query of Constituents that included a list of all their phone numbers in a single field. LISTAGG is not available for those of us using T-SQL … the solution for T-SQL is to use XML PATH().

Here’s a link to a source that demonstrates how to use it: http://sqlandme.com/2011/04/27/tsql-concatenate-rows-using-for-xml-path/.

One key difference in this method from LISTAGG is that the delimiter character (such as the comma—or characters as in your example where you have comma space) is added before each value returned so the raw string returned will have this character before the first value. You can use the STUFF() function to eliminate this first delimiter.

Here’s my solution using XML PATH() to return a list of phone numbers from our donor database:

SELECT
CA.ID,
STUFF
(
  (
  SELECT
   '; ' + PH.NUM + ' [' + re7_ReportingTools.dbo.GetLongDescription(PH.PHONETYPEID) + ']'
  FROM
   re7_db.dbo.CONSTIT_ADDRESS_PHONES CAP
    INNER JOIN
     re7_db.dbo.PHONES PH
     ON CAP.PHONESID=PH.PHONESID
  WHERE
     CAP.CONSTITADDRESSID=CA.ID
    AND
     PH.DO_NOT_CALL=0
  ORDER BY CAP.SEQUENCE
  FOR XML PATH('')
  ),1,2,''
) PHONENUMS
FROM
re7_db.dbo.CONSTIT_ADDRESS CA
 
GROUP BY CA.ID

I concatenated my list using a semi-colon followed by a space ‘; ‘. I used the STUFF() function to remove the leading semi-colon and space.

Here are a few sample rows that are returned (note I’ve “greeked” the prefixes for donor privacy):

table

7 July 2014

Mine your donor data with this baseball-inspired analysis

I’ve got baseball analytics on my mind. I don’t know if it’s because of the onset of July or because of a recent mention of CoolData on Nate Silver’s FiveThirtyEight blog, but I have been deeply absorbed in an analysis of donor giving behaviours inspired by Silver’s book, “The Signal and the Noise.” It might give you some ideas for things to try with your own database.

Back in 2003, Silver designed a system to predict the performance of Major League Baseball players. The system, called PECOTA, attempts to understand how a player’s performance evolves as he ages. As Silver writes in his book, its forecasts were probabilistic, offering a range of possible outcomes for each player. From the previous work of others, Silver was aware that hitters reach their peak performance at age 27, on average. Page 81 of his book shows the “aging curve” for major league hitters, a parabola starting at age 18, arcing smoothly upwards through the 20s, peaking at 27, and then just as smoothly arcing downwards to age 39.

My immediate thought on reading about this was, what about donors? Can we visualize the trajectory of various types of donors (major donors, bequest donors, leadership annual fund donors) from their first ten bucks right after graduating, then on into their peak earning years? What would we learn by doing that?

In baseball, the aging curve presents a problem for teams acquiring players with proven track records. By the time they become free agents, their peak years will have passed. However, if the early exploits of a young prospect cause him to resemble one of the greats from the past, perhaps he is worth investing in. The curve, as Silver notes, is only an average. Some players will peak far earlier, and some far later, than the average age of 27. There are different types of players, and difference types of curves, and every individual career is different. But lurking in all that noise, there is a signal.

Silver’s PECOTA system takes things further than that — I will return to that later — but now I want to turn to how we can visualize a sort of aging curve for our donors.

What’s the payoff? Well, to cut to the chase: It appears that some donors who go on to give in six figures (lifetime total) can be distinguished from the majority of lower-level donors at a very early age. Above-average giving ($200 or $250, say) in any one year during one’s late 20s or early 30s is a predictor of very high lifetime value. I have found that when big donors have started their giving young, they have started big. That is, “big” in relation to their similarly-aged peers – not at thousands of dollars, but at $100 or $500, depending on how old they were at the time.

Call it “precocious giving”.

Granted, it sounds a bit like plain common sense. But being analytical about it opens up the possibility of identifying a donor with high lifetime value when they’re still in their late 30s or early 40s. You can judge for yourself, but the idea appealed to me.

I’m getting ahead of myself. At the start of this, I was only interested in getting the data prepared and plotting the curve to see what it would look like. To the extent that it resembled a curve at all, I figured that “peak age” — the counterpart to baseball player performance — would be the precise age at which a donor gave the most in any given year while they were alive.

~~~~~~~

I wrote a query to pull all donors from the database (persons only), with a row for each year of giving, summing on total giving in that year. Along with Year of Gift, I pulled down Year of Birth for each donor — excluding anyone for whom we had no birthdate. I included only amounts given while the donor was living; bequests were excluded.

The next step was to calculate the age the donor was at the time of the gift. I added a column to the data, defined as the Year of Gift minus Year of Birth. That gave me a close-enough figure for age at time of giving.

As I worked on the analysis, I kept going back to the query to add things I needed, such as certain donor attributes that I wanted to examine. Here are most of the variables I ended up pulling from the database for each unique combination of Donor ID and Age at Time of Gift:

  • ID
  • Age at Time of Gift (Year of Gift minus Year of Birth)
  • Sum of Giving (total giving for that donor, at that age)
  • Donor Category Code (Alum, Friend, etc.)
  • Total Lifetime Giving (for each donor, without regard to age)
  • Deceased Indicator (are they living or dead as of today)
  • Current Age (if living, how old are they right now)
  • Year of Birth
  • Year of Gift

The result was a data set with more than 200,000 rows. Notice, of course, that a donor ID can appear on multiple rows — one for each value of Age at Gift. The key thing to remember is that I didn’t care what year giving occurred, I only wanted to know how old someone was when they gave. So in my results, a donor who gave in 1963 when she was 42 is much the same as a donor who gave in 2013 he was the same age.

pecota1

Now it was time to visualize this data, and for that I used Tableau. I connected directly to the database and loaded the data into Tableau using custom SQL. ‘Age at Gift’ is numerical, so Tableau automatically listed that variable in the Measures panel. For this analysis, I wanted to treat it as a category instead, so I dragged it into the Dimensions panel. (If you’re not familiar with Tableau, don’t worry about these application-specific steps — once you get the general idea, you can replicate this using your tool of choice.)

The first (and easiest) thing to visualize was simply the number of donors at each age. Click on the image below to see a full-size version. Every part of the shape of this curve says something interesting, I think, but the one thing I have annotated here is the age at which the largest number of people chose to make a gift.

pecota2

This chart lumps in everyone — alumni and non-alumni, living donors and deceased donors — so I wanted to go a little deeper. I would expect to see a difference between alumni and non-alumni, for example, so I put all degree and non-degree alumni into one category (Alumni), and all other donor constituents into another (Non-alumni). The curve does not change dramatically, but we can see that the number of non-alumni donors peaks later than the number of alumni donors.

pecota3

There are a number of reasons for analyzing alumni and non-alumni separately, so from this point on, I decided to exclude non-alumni.

The fact that 46 seems to be an important age is interesting, but this probably says as much about the age composition of our alumni and our fundraising effort over the years as it does about donor behaviour. To get a sense of how this might be true, I divided all alumni donors into quartiles (four bins containing roughly equal numbers of alumni), by Birth Year. Alumni donors broke down this way:

  1. Born 1873 to 1944: 8,101 donors
  2. Born 1945 to 1955: 8,036 donors
  3. Born 1956 to 1966: 8,614 donors
  4. Born 1967 to 1991: 8,172 donors

Clearly these are very different cohorts! The donors in the middle two quartiles were born in a span of only a decade each, while the span of the youngest quartile is 24 years, and the span of the oldest quartile is 71 years! When I charted each age group separately, they split into distinct phases. (Reminder: click on the image for a full-size version.)

pecota4

This chart highlights a significant problem with visualizing the life cycle of donors: Many of the donors in the data aren’t finished their giving careers yet. When Nate Silver talks about the aging curves of baseball players, he means players whose career is behind them. How else to see their rise, peak, and eventual decline? According to the chart above, the youngest quartile peaks (in terms of number of donors) at age 26. However, most of these donors are still alive and have many years of giving ahead of them. We will turn to them to identify up-and-coming donors, but as long as we are trying to map out what a lifetime of giving looks like, we need to focus on the oldest donors.

An additional problem is that our donor database doesn’t go back as far as baseball stats do. Sure, we’ve got people in the database who were born more than 140 years ago, but our giving records are very sparse for years before the early 1970s. If a donor was very mature at that time, his apparent lack of giving history might cause us to make erroneous observations.

I decided to limit the data set to donors born between 1920 and 1944. This excludes the following donors who are likely to have incomplete giving histories:

  • Anyone who was older than 50 in 1970, when giving records really started to get tracked, and
  • Anyone who is currently younger than 70, and may have many years of giving left.

This is a bit arbitrary, but reasonable. It trims off the donors who could never have had a chance to have a lifetime of giving recorded in the data, without unduly reducing the size of my data set. I was left with only 20% of my original data, but still, that’s more than 6,000 individuals. I could have gotten fussier with this, removing anyone who died at a relatively young age, but I figured the data was good enough to provide some insights.

The dramatic difference made by this trimming is evident in the following two charts. Both charts show a line for the number of donors by age at time of gift, for each of three lifetime giving levels: Under $1,000 in blue, $1,000 to $10,000 in orange, and over $10,000 in purple. What this means is that all the donors represented by the purple line (for example) gave at least $10,000 cumulatively over the course of their lifetime.

The first chart is based on ALL the data, before trimming according to birth year. The second chart is based on the 6,000 or so records I was left with after trimming. The first chart seems to offer an interesting insight: The higher the lifetime value of donors, the later in life they tend to show up in great numbers. But of course this just isn’t true. Although the number of donors with lower lifetime giving peaks at earlier ages, that’s only because that whole group of donors is younger: They’re not done giving yet. (I have added ‘Median Current Age’ to the high point of each curve to illustrate this.) Remember, this chart includes everyone — it’s the “untrimmed” data:

pecota5
Contrast that three-phase chart with this next one, based on “trimmed” data. The curves are more aligned, presumably because we are now looking at a better-defined cohort of donors (those born 1920 to 1944). The oldest donor is 24 years older than the youngest donor, but that’s okay: The most important concern is having good data for that range of ages. Because the tops of these curves are flatter, I have annotated more points, for the sake of interest.

pecota6

These curves are pretty, but they aren’t analogous to “performance curves” for baseball players — we haven’t yet looked at how MUCH donors give, on average, at each age. However, what general observations can we make from the last chart? Some that come to my mind:

  • Regardless of what a donor finally ends up giving lifetime, there are always a few (a very few) who start giving while they are in their 20s, and a few who are still around to give when they are in their late 80s and early 90s.
  • The number of donors starts to really take off at around age 40, and there is steady growth until about age 50, when the growth in number of donors begins to slow or plateau.
  • Donors start to drop out rapidly at around age 70. This is due to mortality of course, but probably the steepness of the drop is exaggerated by my trimming of the data at the older end.

~~~~~~~

Here is where things really get interesting. The whole point of this exercise was to see if we can spot the telltale signs of a future major donor while they are still relatively young, just as a baseball scout looks for young prospects who haven’t peaked yet. Do donors signal unusual generosity even when they are still in their 20s and 30s? Let’s have a look.

I zoomed in on a very small part of the chart, to show giving activity up until age 35. Are there differences between the various levels of donors? You bet there are.

As soon as a high-lifetime-value donor starts to give, the gifts are higher, relative to same-age peers who will end up giving less. The number of donors at these early ages is miniscule, so take this with a grain of salt, but a trend seems unmistakable: Up to the age of 30, donors who will end up giving in five figures and higher give about 2.5 to 3.5 times as much per year as other donors their age who end up giving $1,000 to $10,000 lifetime. AND, they give FIVE TIMES as much per year as other donors their age who end up giving less than $1,000 lifetime.

pecota7

Later on, at ages 35 and 40, donors who will finish their giving careers at the high end are giving two to three times as much per year as donors in the middle range, and 5.6 to 7 times per year (respectively) as donors who will finish on the lowest end.

It might be less confusing to chart each group of donors by average giving per year, rather than by number of donors. This chart shows average giving per year up until age 65. Naturally, the averages get very spiky, as donors start making large gifts.

 

pecota8

To temper the effect of extreme values, I log-transformed the giving amounts. This made it easier to visualize how these three tiers of donors differ from each other over a lifetime of giving:

pecota9

What do I see from this? These are generalizations based on averages, but potentially useful generalizations:

  • Upper-end donors start strong relative to other donors, accelerate giving after age 40, and continue to increase giving throughout their lifetimes.
  • Middle- and low-range donors start lower. They also increase their yearly giving until their late 40s, but after that, they plateau and stay at the same level for the rest of their lives.

~~~~~~~

What’s the bottom line here? I think it’s this: Hundreds of donors were well on their way to being exceptional by the tender age of 40, and a few were signaling long before that.

Information like this would be interesting to Annual Fund as they work to identify prospects for leadership-level giving. But $10,000 in a lifetime is a little too low to make the Major Gifts folks take notice. Can we carve out the really big donors from the $10K-plus crowd? And can we also identify them before they hit 40? Have a look at this chart. For this one, I removed all the donors who gave less than $10,000 lifetime, and then I divided the high-end donors into those who gave less than $100,000 lifetime (green line) and those who gave more than $100,000 (red line).

pecota10

 

The lines get a bit jagged, but it looks to me like the six-figure lifetime donors pull away from the five-figure donors while still in their 40s. And notice as well that they increase their giving after age 65, which is very unusual behaviour: By 65, the vast majority of donors have either long plateaued or are starting to wind down. (You can’t see this in the chart, but that post-65 group of very generous donors numbers about 50 individuals, with yearly average giving ranging around $25,000 to $50,000.)

When I drill down, I can see about a hundred donors sitting along the red line between the ages of 30 and 45, whom we might have identified as exceptional, had we known what to look for.

With the benefit of hindsight, we are now able to look at current donors who were born more recently (after 1969, say), and identify who’s sending out early signals. I have those charts, but I think you’ve seen enough, and as I have said many times in the past: My data is not your data. So while I can propose the following “rules” for identifying an up-and-comer, I don’t recommend you try applying them to your own situation without running your own analysis:

  • Gave more than $200 in one year, starting around age 28.
  • Gave more than $250 in one year, starting around age 29.
  • Gave more than $500 in one year, starting around age 32.

Does this mean I think we can ask a 32-year-old for $10,000 this year? No. It means that this 32-year-old is someone to watch out for and to keep engaged as an alum. It’s the donors over 50 or so who have exhibited these telltale patterns in their early giving that might belong in a major gift prospect portfolio.

Precocious giving certainly isn’t the only indicator of a good prospect, but along with a few other unusual traits, it is a good start. (See: Odd but true findings? Upgrading annual donors are “erratic” and “volatile”.)

~~~~~~~

Where do you go from here? That is completely up to you. I am still in the process of figuring out how to best use these insights.

Coming up with some rules of thumb, as above, is one way to proceed. Another is rolling up all of a donor’s early giving into a single score — a Precocity Score — that takes into account both how much a donor gave, and how young she was when she gave it. I experimented with a formula that gave progressively higher weights to the number of dollars given for younger ages. For example, $100 given at age 26 might be worth several times more than $200 given at age 44.

Using my data set of donors with a full life cycle of giving, I tested whether this score was predictive of lifetime value. It certainly was. However, I also found that a simple cumulative sum of a donor’s giving up to age 35 or 40 was equally as predictive. There didn’t seem to be any additional benefit to giving extra weight to very early giving.

I am shying away from using giving history as an input in a predictive model. I see people do this all the time, but I have always avoided the practice. My preference is to use some version of the rules above as just one more tool to use in prospect identification, distinct from other efforts such as predictive modelling.

~~~~~~~

That’s as far as I have gotten. If this discussion has given you some ideas to explore, then wonderful. I doubt I’m breaking new ground here, so if you’ve already analyzed giving-by-age, I’d be interested in hearing how you’ve applied what you’ve learned.

Incidentally, Nate Silver went on to produce “similarity scores” for pairs of hitters. Using baseball’s rich trove of data, he compared players using a nearest-neighbour analysis, which took into account a wide range of data points, from player height and weight to all the game stats that baseball is famous for. A young prospect in the minor leagues with a score that indicates a high degree of similarity with a known star might be expected to “age” in a similar way. That was the theory, anyway.

One can imagine how this might translate to the fundraising arena. If you identified groups of your best donors, with a high degree of similarity among the members of each group, you could then identify younger donors with characteristics that are similar to the members of each group. After all, major gift donors are not all alike, so why not try to fit multiple “types”?

I would guess that the relatively small size of each group would cause any signal to get drowned out in the noise. I am a little skeptical that we can parse things that finely. It would, however, be an interesting project.

A final note. The PECOTA system had some successes and for a time was an improvement on existing predictive tools. Over time, however, pure statistics were not a match for the combination of quantitative methods and the experience and knowledge of talent scouts. In the same way, identifying the best prospects for fundraising relies on the combined wisdom of data analysts, researchers and fundraisers themselves.

25 June 2014

How our sector is getting its butt kicked by just about everyone

Filed under: Analytics, Data, Off on a tangent, skeptics — kevinmacdonell @ 8:24 pm

There isn’t a lot to do at my wife’s family summer cottage when it rains, especially if I’ve forgotten to bring a book. I find myself scanning the shelves for something — anything — to read. On one such recent rainy weekend, I picked up a book my niece had left on a table. It was a heavy hardcover textbook, and it contained a mild surprise.

What I found was an introduction to such topics as liner and non-linear relationships, probability, scatterplots, best-fit lines, and correlation — concepts that I’ve come to have a deep interest in, mainly because I have profitably put them to work in the service of fundraising and alumni engagement.

Was this a college textbook? A manual for budding data scientists?

No, not at all. My niece is in Grade 9, and this was her mathematics textbook.

I don’t know if the Nova Scotia math curriculum is typical, nor am I qualified to judge the quality of a textbook. And my niece may not be thrilled about learning statistics. But some group of experts in math education apparently believe these concepts are well within the grasp of young Nova Scotian minds. Power to them.

What does this have to do with you? Yes, plastic young minds may grasp with relative ease what we oldsters struggle with (new languages, for example), but we have one distinct advantage. Where adolescents view these concepts as abstractions without a purpose, we may immediately see how we can use them to advance our causes, and our careers.

Yet, we all know otherwise intelligent people in our field whose eyes glaze over when they see a chart or hear anything that sounds like math — even Grade 9 math. Somehow, we must be failing to demonstrate the connection between analytics and success in fundraising and alumni engagement.

So in what fields is analytics really taking root? Well, every field. Including farmers’ fields.

Food production has been a focus of science and statistics for many decades. But today it’s not confined to experimental farms or the labs of agribusiness companies. Real, honest-to-goodness farmers are enthusiastic quants compared to most of us working in the nonprofit sector.

tweetJennifer Cunningham @jenlynham is Senior Director, Metrics and Marketing in the Office of Alumni Affairs at Cornell University. In a recent email to me and my “Score!” co-writer Peter Wylie she writes: “Just gave a talk today at the National Agricultural Alumni and Development Association (NAADA) conference. Went on a hayride this afternoon with the group here at Penn State. The farmers here are using data like you wouldn’t believe. Guys have been farming for 30+ years and they’re going on and on about the importance of measuring input vs output … it’s so interesting to hear these old-school guys go on about the importance of it in their worlds. And yet, some people in our industry, raising billions, still don’t get it?!?!”

It’s a fair observation.

I have a lot of time for people who are not enamoured with analytics due to an unfamiliarity with working with numbers. They require explanations and justifications for using analytical methods. That’s fine. I myself didn’t see math as having much to do with my working life until I entered my mid-thirties, and sometimes I still think the right story beats numbers.

But like our friend Jennifer, I feel less sympathy for ignorance when it’s a deliberate choice. There’s a line where lack of interest in data equates to wilful illiteracy. Someday soon, being on the wrong side of that line is going to disqualify a person from working for important causes.

Older Posts »

The Silver is the New Black Theme. Create a free website or blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 1,068 other followers