CoolData blog

22 September 2014

What predictor variables should you avoid? Depends on who you ask

People who build predictive models will tell you that there are certain variables you should avoid using as predictors. I am one of those people. However, we disagree on WHICH variables one should avoid, and increasingly this conflicting advice is confusing those trying to learn predictive modeling.

The differences involve two points in particular. Assuming charitable giving is the behaviour we’re modelling for, those two things are:

  1. Whether we should use past giving to predict future giving, and
  2. Whether attributes such as marital status are really predictors of giving.

I will offer my opinions on both points. Note that they are opinions, not definitive answers.

1. Past giving as a predictor

I have always stressed that if you are trying to predict “giving” using a multiple linear regression model, you must avoid using “giving” as a predictor among your independent variables. That includes anything that is a proxy for “giving,” such as attendance at a donor-thanking event. This is how I’ve been taught and that is what I’ve adhered to in practice.

Examples that violate this practice keep popping up, however. I have an email from Atsuko Umeki, IT Coordinator in the Development Office of the University of Victoria in Victoria, British Columbia*. She poses this question about a post I wrote in July 2013:

“In this post you said, ‘In predictive models, giving and variables related to the activity of giving are usually excluded as variables (if ‘giving’ is what we are trying to predict). Using any aspect of the target variable as an input is bad practice in predictive modelling and is carefully avoided.’  However, in many articles and classes I read and took I was advised or instructed to include past giving history such as RFA*, Average gift, Past 3 or 5 year total giving, last gift etc. Theoretically I understand what you say because past giving is related to the target variable (giving likelihood); therefore, it will be biased. But in practice most practitioners include past giving as variables and especially RFA seems to be a good variable to include.”

(* RFA is a variation of the more familiar RFM score, based on giving history — Recency, Frequency, and Monetary value.)

So modellers-in-training are being told to go ahead and use ‘giving’ to predict ‘giving’, but that’s not all: Certain analytics vendors also routinely include variables based on past giving as predictors of future giving. Not long ago I sat in on a webinar hosted by a consultant, which referenced the work of one well-known analytics vendor (no need to name the vendor here) in which it seemed that giving behaviour was present on both sides of the regression equation. Not surprisingly, this vendor “achieved” a fantastic R-squared value of 86%. (Fantastic as in “like a fantasy,” perhaps?)

This is not as arcane or technical as it sounds. When you use giving to predict giving, you are essentially saying, “The people who will make big gifts in the future are the ones who have made big gifts in the past.” This is actually true! The thing is, you don’t need a predictive model to produce such a prospect list; all you need is a list of your top donors.

Now, this might be reassuring to whomever is paying a vendor big bucks to create the model. That person sees names they recognize, and they think, ah, good — we are not too far off the mark. And if you’re trying to convince your boss of the value of predictive modelling, he or she might like to see the upper ranks filled with familiar names.

I don’t find any of that “reassuring.” I find it a waste of time and effort — a fancy and expensive way to produce a list of the usual suspects.

If you want to know who has given you a lot of money, you make a list of everyone in your database and sort it in descending order by total amount given. If you want to predict who in your database is most likely to give you a lot of money in the future, build a predictive model using predictors that are associated with having given large amounts of money. Here is the key point … if you include “predictors” that mean the same thing as “has given a lot of money,” then the result of your model is not going to look like a list of future givers — it’s going to look more like your historical list of past givers.

Does that mean you should ignore giving history? No! Ideally you’d like to identify the donors who have made four-figure gifts who really have the capacity and affinity to make six-figure gifts. You won’t find them using past giving as a predictor, because your model will be blinded by the stars. The variables that represent giving history will cause all other affinity-related variables to pale in comparison. Many will be rejected from the model for being not significant or for adding nothing additional to the model’s ability to explain the variance in the outcome variable.

To sum up, here are the two big problems with using past giving to predict future giving:

  1. The resulting insights are sensible but not very interesting: People who gave before tend to give again. Or, stated another way: “Donors will be donors.” Fundraisers don’t need data scientists to tell them that.
  2. Giving-related independent variables will be so highly correlated with giving-related dependent variables that they will eclipse more subtle affinity-related variables. Weaker predictors will end up getting kicked out of our regression analysis because they can’t move the needle on R-squared, or because they don’t register as significant. Yet, it’s these weaker variables that we need in order to identify new prospects.

Let’s try a thought experiment. What if I told you that I had a secret predictor that, once introduced into a regression analysis, could explain 100% of the variance in the dependent variable ‘Lifetime Giving’? That’s right — the highest value for R-squared possible, all with a single predictor. Would you pay me a lot of money for that? What is this magic variable that perfectly models the variance in ‘Lifetime Giving’? Why, it is none other than ‘Lifetime Giving’ itself! Any variable is perfectly correlated with itself, so why look any farther?

This is an extreme example. In a real predictive model, a predictor based on giving history would be restricted to giving from the past, while the outcome variable would be calculated from a more recent period — the last year or whatever. There should be no overlap. R-squared would not be 100%, but it would be very high.

The R-squared statistic is useful for guiding you as you add variables to a regression analysis, or for comparing similar models in terms of fit with the data. It is not terribly useful for deciding whether any one model is good or bad. A model with an R-squared of 15% may be highly valuable, while one with R-squared of 75% may be garbage. If a vendor is trying to sell you on a model they built based on a high R-squared alone, they are misleading you.

The goal of predictive modeling for major gifts is not to maximize R-squared. It’s to identify new prospects.

2. Using “attributes” as predictors

Another thing about that webinar bugged me. The same vendor advised us to “select variables with caution, avoiding ‘descriptors’ and focusing on potential predictors.” Specifically, we were warned that a marital status of ‘married’ will emerge as correlated with giving. Don’t be fooled! That’s not a predictor, they said.

So let me get this straight. We carry out an analysis that reveals that married people are more likely to give large gifts, that donors with more than one degree are more likely to give large gifts, that donors who have email addresses and business phone numbers in the database are more likely to give large gifts … but we are supposed to ignore all that?

The problem might not be the use of “descriptors,” the problem might be with the terminology. Maybe we need to stop using the word “predictor”. One experienced practitioner, Alexander Oftelie, briefly touched on this nuance in a recent blog post. I quote, (emphasis added by me):

“Data that on its own may seem unimportant — the channel someone donates, declining to receive the mug or calendar, preferring email to direct mail, or making ‘white mail’ or unsolicited gifts beyond their sustaining-gift donation — can be very powerful when they are brought together to paint a picture of engagement and interaction. Knowing who someone is isn’t by itself predictive (at best it may be correlated). Knowing how constituents choose to engage or not engage with your organization are the most powerful ingredients we have, and its already in our own garden.”

I don’t intend to critique Alexander’s post, which isn’t even on this particular topic. (It’s a good one — please read it.) But since he’s written this, permit me scratch my head about it a bit.

In fact, I think I agree with him that there is a distinction between a behaviour and a descriptor/attribute. A behaviour, an action taken at a specific point in time (eg., attending an event), can be classified as a predictor. An attribute (“who someone is,” eg., whether they are married or single) is better described as a correlate. I would also be willing to bet that if we carefully compared behavioural variables to attribute variables, the behaviours would outperform, as Alexander says.

In practice, however, we don’t need to make that distinction. If we are using regression to build our models, we are concerned solely and completely with correlation. To say “at best it may be correlated” suggests that predictive modellers have something better at their disposal that they should be using instead of correlation. What is it? I don’t know, and Alexander doesn’t say.

If in a given data set, we can demonstrate that being married is associated with likelihood to make a donation, then it only makes sense to use that variable in our model. Choosing to exclude it based on our assumption that it’s an attribute and not a behaviour doesn’t make business sense. We are looking for practical results, after all, not chasing some notion of purity. And let’s not fool ourselves, or clients, that we are getting down to causation. We aren’t.

Consider that at least some “attributes” can be stated in terms of a behaviour. People get married — that’s a behaviour, although not related to our institution. People get married and also tell us about it (or allow it to be public knowledge so that we can record it) — that’s also a behaviour, and potentially an interaction with us. And on the other side of the coin, behaviours or interactions can be stated as attributes — a person can be an event attendee, a donor, a taker of surveys.

If my analysis informs me that widowed female alumni over the age of 60 are extremely good candidates for a conversation about Planned Giving, then are you really going to tell me I’m wrong to act on that information, just because sex, age and being widowed are not “behaviours” that a person voluntarily carries out? Mmmm — sorry!

Call it quibbling over semantics if you like, but don’t assume it’s so easy to draw a circle around true predictors. There is only one way to surface predictors, which is to take a snapshot of all potentially relevant variables at a point in time, then gather data on the outcome you wish to predict (eg., giving) after that point in time, and then assess each variable in terms of the strength of association with that outcome. The tools we use to make that assessment are nothing other than correlation and significance. Again, if there are other tools in common usage, then I don’t know about them.

Caveats and concessions

I don’t maintain that this or that practice is “wrong” in all cases, nor do I insist on rules that apply universally. There’s a lot of art in this science, after all.

Using giving history as a predictor:

  • One may use some aspects of giving to predict outcomes that are not precisely the same as ‘Giving’, for example, likelihood to enter into a Planned Giving arrangement. The required degree of difference between predictors and outcome is a matter of judgement. I usually err on the side of scrupulously avoiding ANY leakage of the outcome side of the equation into the predictor side — but sure, rules can be bent.
  • I’ve explored the use of very early giving (the existence and size of gifts made by donors before age 30) to predict significant giving late in life. (See Mine your donor data with this baseball-inspired analysis.) But even then, I don’t use that as a variable in a model; it’s more of a flag used to help select prospects, in addition to modeling.

Using descriptors/attributes as predictors:

  • Some variables of this sort will appear to have subtly predictive effects in-model, effects that disappear when the model is deployed and new data starts coming in. That’s regrettable, but it’s something you can learn from — not a reason to toss all such variables into the trash, untested. The association between marital status and giving might be just a spurious correlation — or it might not be.
  • Business knowledge mixed with common sense will help keep you out of trouble. A bit of reflection should lead you to consider using ‘Married’ or ‘Number of Degrees’, while ignoring ‘Birth Month’ or ‘Eye Colour’. (Or astrological sign!)

There are many approaches one can take with predictive modeling, and naturally one may feel that one’s chosen method is “best”. The only sure way to proceed is to take the time to define exactly what you want to predict, try more than one approach, and then evaluate the performance of the scores when you have actual results available — which could be a year after deployment. We can listen to what experts are telling us, but it’s more important to listen to what the data is telling us.

//////////

Note: When I originally posted this, I referred to Atsuko Umeki as “he”. I apologize for this careless error and for whatever erroneous assumption that must have prompted it.

16 July 2013

Alumni engagement scoring vs. predictive modelling

Filed under: Alumni, engagement, predictive modeling — Tags: , , , — kevinmacdonell @ 8:06 am

Alumni engagement scoring has an undeniable appeal. What could be simpler? Just add up how many events an alum has attended, add more points for volunteering, add more points for supporting the Annual Fund, and maybe some points for other factors that seem related to engagement, and there you have your score. If you want to get more sophisticated, you can try weighting each score input, but generally engagement scoring doesn’t involve any advanced statistics and is easily grasped.

Not so with predictive modelling, which does involve advanced stats and isn’t nearly as intuitive; often it’s not possible to really say how an input variable is related to the outcome. It’s tempting, too, to think of an engagement score as being a predictor of giving and therefore a good replacement for modelling. Actually, it should be predictive — if it isn’t, your score is not measuring the right things — but an engagement score is not the same thing as a predictive model score. They are different tools for different jobs.

Not only are engagement scoring schemes different from predictive models, their simplicity is deceptive. Engagement scoring is incomplete without some plan for acting on observed trends with targeted programming. This implies the ability to establish causal drivers of engagement, which is a tricky thing.

That’s a sequence of events — not a one-time thing. In fact, engagement scoring is like checking the temperature at regular intervals over a long period of time, looking for up and down trends not just for the group as a whole but via comparisons of important subgroups defined by age, sex, class year, college, degree program, geography or other divisions. This requires discipline: taking measurements in exactly the same way every year (or quarter, or what-have-you). If the score is fed by a survey component, you must survey constantly and consistently.

Predictive models and engagement scores have some surface similarities. They share variables in common, the output of both is a numerical score applied to every individual, and both require database work and math in order to calculate them. Beyond that, however, they are built in different ways and for different purposes. To summarize:

  • Predictive models are collections of potentially dozens of database variables weighted according to strength of correlation with a well-defined behaviour one is trying to predict (eg. making a gift), in order to rank individuals by likelihood to engage in that behaviour. Both Alumni Relations and Development can benefit from the use of predictive models.
  • Engagement scores are collections of a very few selectively-chosen database variables, either not weighted or weighted according to common sense and intuition, in order to roughly quantify the quality of “engagement”, however one wishes to define that term, for each individual. The purpose is to allow comparison of groups (faculties, age bands, geographical regions, etc.) with each other. Comparisons may be made at one point in time, but it is more useful to compare relative changes over time. The main user of scores is Alumni Relations, in order to identify segments requiring targeted programming, for example, and to assess the impact of programming on targeted segments over time.

Let’s explore key differences in more depth:

The purpose of modelling is prediction, for ranking or segmentation. The purpose of engagement scoring is comparison.

Predictive modelling scores are not usually included in reports. Used immediately in decision making, they may never be seen by more than one or two people. Engagement scores are included in reports and dashboards, and influence decision-making over a long span of time.

The target variable of a predictive model is quantifiable (eg. giving, measurable in dollars). In engagement scoring, there is no target variable, only an output – a construct called “engagement”, which itself is not directly measurable.

Potential input variables for predictive models are numerous (100+) and vary from model to model. Input variables for engagement scores are limited to a handful of easily measured attributes (giving, event attendance, volunteering) which must remain consistent over time.

Variables for predictive models are chosen primarily using statistical methods (correlation) and only secondarily using judgment and “common sense.” For example, if the presence of a business phone number is highly correlated with being a donor, it may be included in the model. For engagement scores, variables are chosen by consensus of stakeholders, primarily according to subjective standards. For example, event attendance and giving would probably be deemed by the committee to indicate engagement, and would therefore be included in the score. Advanced statistics rarely come into play. (For more thoughts on this, read How you measure alumni engagement is up to you.)

In predictive models, giving and variables related to the activity of giving are usually excluded as variables (if ‘giving’ is what we are trying to predict). Using any aspect of the target variable as an input is bad practice in predictive modelling and is carefully avoided. You wouldn’t, for example, use attendance at a donor recognition event to predict likelihood to give. In engagement scoring, though, giving history is usually a key input, as it is common sense to believe that being a donor is an indication of engagement. (It might be excluded or reported separately if the aim is to demonstrate the causal link between engagement indicators and giving.)

Modelling variables are weighted using multiple linear regression or other statistical method which calculates the relative influence of each variable while simultaneously controlling for the influence of all other variables in the model. Engagement score variables are usually weighted according to gut feel. For example, coming to campus for Homecoming seems to carry more weight than showing up for a pub night in one’s own city, therefore we give it more weight.

The quality of a predictive model is testable, first against a validation data set, and later against actual results. But there is no right or wrong way to estimate engagement, therefore the quality of scores cannot be evaluated conclusively.

The variables in a predictive model have complex relationships with each other that are difficult or impossible to explain except very generally. Usually there is no reason to explain a model in detail. The components in an engagement score, on the other hand, have plausible (although not verifiable) connections to engagement. For example, volunteering is indicative of engagement, while Name Prefix is irrelevant.

Predictive models are built for a single, time-limited purpose and then thrown away. They evolve iteratively and are ever-changing. On the other hand, once established, the method for calculating an engagement score must not change if comparisons are to be made over time. Consistency is key.

Which is all to say: alumni engagement scoring is not predictive modelling. (And neither is RFM analysis.) Only predictive modelling is predictive modelling.

12 July 2012

Evaluate models with fresh data using Tableau heat maps

When I build predictive models, I normally don’t build just one for each purpose. Presumably the model is going to be used, so I want it to be the best one possible. Yes, I test the model scores against a holdout data sample, but if I built only one model, I wouldn’t have anything solid on which to base my evaluation of the results. I might reject a lone model if it truly failed against the validation set, but that has never happened to me — even a lackluster performance can be better than nothing, and therefore the model is flawed, but useful. That statement is true of models in general. So testing results with nothing to compare against is pointless.

I usually produce one multiple linear regression model and one binary logistic regression model using the stats software package Data Desk. Many permutations are possible, though: The sample to be scored can be limited in various ways, and the dependent variable can be formulated any number of ways. The choice of technique (for me, one type of regression or another) is usually determined by the nature of the DV (though not always). Given unlimited time, I would produce multiple models, but doing two at a time is manageable and keeps the task of comparison simple. The model that does the best classifying the members of the holdout sample wins the prize, and the loser is discarded.

But there’s a problem. I’ve never had a model bomb when it comes to scoring the validation set, but I HAVE had models fail after deployment. Data that is held out for validation of the model is one thing — the real world outside the model can be a whole OTHER thing. Logically it should not be so: If the model doesn’t “know” anything about the holdout data, then you’d think its performance on it would indicate how it will perform in the future.

Not so. At least, not always.

I am not so quick, then, to discard the loser. I like to evaluate both models on fresh data as it comes in (new gifts, for example). The loser might be the better choice overall, or it might turn out that a combination of the two models performs better than one on its own. Maybe one model works better for a subset of the population (young alumni, say), which suggests that adding interaction terms or even using a multiple-model approach is something to consider in the future. If the models predict slightly different propensities (as a result of how the DVs were formulated), with both of them contributors to a desirable result, then it might be worthwhile keeping both score sets by multiplying them together.

I don’t have an extended period of time for such testing — the model needs to be put into operation before it gets stale. Unfortunately, evaluation has always been a cumbersome process. I need to query the database for fresh results (conversions, upgrades, new planned giving expectancies — whatever) and then match it up by ID and score for each model (scores for untested models are not going to be in the database, obviously), and then produce some charts in Excel to visualize and compare results. It’s not a ton of work, but it takes just long enough to prevent me from doing it more than once before it’s time to commit. Even if I am evaluating the models after the fact, in order to learn for the next iteration of model-building, it’s not an exercise I will want to carry out repeatedly.

There is a better way. Think reports.

What does a report do? A report pulls real-time (or nightly-refreshed) data and assembles it in an interpretable way in a tabular or visual display. It performs this service on a regular or semi-regular basis, or on-demand. (Yeah, okay, maybe I should have said an ideal report). If part of your job consists in report preparation as well as predictive modeling, then you should be building model scores into your reports.

Here’s a tutorial on how to use Tableau to easily create a report that compares the performance of two sets of model scores in a single visualization called a heat map. This visualization can be refreshed with live data as often as desired. If you want, you can add other fields (age, sex, degree, donor status, etc.) and easily filter the data to see how model performance differs depending on the composition of the population. Note that this is probably not a report you’ll be sharing with your vice president. It does look cool, but it is mainly a diagnostic and exploration tool for your own use. The small initial investment of time is worth it if you build multiple models — it can be reused again and again.

This tutorial assumes you’re already somewhat familiar with the basics of Tableau. If you don’t have the software, and you don’t want to download a free trial, stick around anyway — other software packages offer ways to create heat maps, and the basic idea is the same.

In this example, I am comparing percentile scores from two models I developed to predict which alumni are most likely to give at least $1,000 in the current fiscal year. One is a multiple linear regression model with a dependent variable defined as the sum of giving for the past five years (log-transformed). The other is a logistic regression model with a binary dependent variable defined as ‘has giving of at least $1,000 in any one of the past three years’. The exact definitions of the DVs are reasonable but somewhat arbitrary. They are closely related, but different. The techniques and the predictor variables are also different, so we should expect the models to yield different results. Tested against the validation set (which was the same for both models), the logistic model proved superior. But only a test on new gift data will be truly convincing.

I want to take the entire population of alumni whom I have scored (a sample of about 27,000 individuals), and match them up with what they have given since the model was created. In this made-up example, let’s suppose I created my models last August, and I want to see what those 27,000 alumni have given since the day I completed the work. In reality, I would have chosen a winning model months ago and this would be an after-the-fact analysis, but I am doing this in order to enrich the visualization for the purposes of this example. (Cheating, in other words.)

Tableau allows you to combine data from multiple sources. In this case, you will connect to an Excel file to get your model scores (since they’re not in the database), and then connect to your database for giving results since September 1. If you do not connect directly to your database from Tableau, then you can paste your gifts data into a second sheet in your Excel workbook and extract the data via a single connection to that file — no problem. The first worksheet will have three columns: One for unique ID, and one each for the scores from the two models. In this example, the scores were output from Data Desk as percentiles. If you want, you can add columns for key attributes such as age, sex and so on. The second worksheet (or the custom SQL that retrieves data directly from your data warehouse) will provide ID and sum of giving since September 1.

Normally in report creation, Tableau handles all the aggregation of the data — the input is raw transaction data, with each ID potentially appearing on multiple rows. In this example, however, we have aggregated the data already (summing giving by ID), and there is only one row of data for each ID. It doesn’t matter, but it might have implications for some of the specific steps that follow.

You should refer to your Tableau references for connecting to data sources. All I will add is that when you add the table (or worksheet) that contains the giving data, be sure to left-join on ID, because obviously not everyone you have scored has given since Sept. 1. From here on in, I will use Tableau terminology that won’t make any sense if you don’t know the software (specifically, Tableau Desktop version 7.0). Let’s build our first view:

  1. If your data has been extracted correctly, ‘ID’ will be listed under Dimensions, and your two model score sets will be listed under Measures. In this example, I will from now on refer to them as MLR (for Multiple Linear Regression) and Logistic. Obviously I’m referring to my own data — just try to replicate what I’m talking about using your own data file.
  2. For now, pause Auto Updates (or turn off automatic updates).
  3. Right-click on Logistic and select “Create bins …” This will bin the percentile score into whatever size we desire. Change the default bin size to 5 and click OK. Note that a new variable is created in the Dimensions pane, because bins are categorical, not numerical.
  4. Right-click on MLR and do the same thing.
  5. Drag Logistic (bins) to the Columns shelf. Drag MLR (bins) to the Rows shelf.
  6. Drag ID to the Text shelf. Click on the down-arrow of the ID pill you’ve just created, and select Measure –> Count. This will create a count of all IDs that fall into each cell. It turns green to indicate it’s now a measure instead of a dimension. (Because each ID appears in our data only once, it doesn’t matter whether we use either Count or Count Distinct.)
  7. Change the Marks type from Automatic to Square (right above the Text shelf). Notice that the Text shelf suddenly turns into a Label shelf — each square of the heat map will be labeled with the number of IDs.
  8. Drag ID from the Dimensions pane again, and this time drop it onto the Color shelf.
  9. Click on the down-arrow of the ID pill you’ve just created, and select Measure –> Count. This will base the color or shading of the cell on the number of IDs that fall into that cell.

The top left corner of your screen will look like this:

Now we’re ready to allow the view to automatically update. The result won’t look much like a heat map: Probably just a bunch of little squares with numbers beside them. We need to enlarge the squares. Under the Size shelf is a slider: Move this to the centre of the size range. Then drag one of the rows in the view to make it taller — hover over the axis for MLR (on the far left) until the pointer turns into an up-and-down arrow, then click and drag. When you let go, the squares will resize and the alleys of white space should start to close up. Keep messing with it until the squares touch on all sides. With a little formatting of labels for readability, the final product will look something like this. (Click on thumbnail image for full size.)

A heat map can convey a lot of information at a glance. You can immediately see where a lot of individuals are concentrated: They’re in the darkest squares. The numbers are hard to read, but up in the top left of the map, we see that the number of people who fall into the 0-4 bin in both the MLR and Logistic models is 572. In the lower right area of the map, we see that 563 people fell into the 95 to 99 bin in both models. Notice that Tableau didn’t bin evenly: Every single bin has 5 score levels in it except for the bin labeled 100, which contains only individuals with a score of 100. In the map, we see that 147 people scored exactly 100 in both models. This can be corrected (using a calculated field instead of automatic binning), but I have decided to leave it the way it is. Due to the nature of this modeling exercise, I am mainly interested in the top few percentile scores anyway, and the 100 group is of particular interest. Having them mapped separately from the rest is not a problem.

The names of the bins don’t reflect what they include. For example, “90” really means “90 to 94”. You can rename them using aliases. Right-click on Logistic in the Dimensions pane, select Field Properties –> Aliases…, and change the displayed values in the Values column. Do the same for MLR.

We haven’t looked at the recent-gift data yet, but before we move on, what can we learn from this view? It appears the models agree on the individuals with extremely high or extremely low scores. In the middle range, there is still a lot of agreement but also many more cases of divergence, in which an individuals scores high in one model but low in the other. This is clear, at-a-glance evidence that our models are similar but different. Depending on the application, choosing one model over the other could have a big effect on the result, for better or worse. In this particular application, where I am interested mainly in very high-scoring alumni only, it may not make that much difference at all … but let’s not jump to that conclusion just yet.

If your data set included some key grouping information such as age or sex, it might be interesting to create a filter to examine whether the models differ on those factors. Here’s an example with ‘Age’:

  1. Drag Age from the Measures pane into the Filters shelf.
  2. When Tableau asks you how you want to filter on Age, select “All Values” and click Next.
  3. On the next box, select Range of Values, and click OK.
  4. Hover over the green Age pill on the filters shelf, click the down-arrow on the right end of the pill, and select Show Quick Filter.

Now you can set the upper and lower age bounds of the individuals you want to be counted in the heat map. As you slide the scale, it will display Age with numbers after the decimal, even though your values are all whole numbers. If this bothers you, right-click on Age in the Measures pane, select Field Properties –> Number Format…, and click on Number (Custom). Adjust the number of decimal places to zero. Here’s what the quick filter looks like:

The next two images show the heat map for different age ranges. The first one is ages 20 to 50, the second is 51 to 80. Again, click on the thumbnails for full-size images — although the beauty of a heat map is that you can see the pattern from a distance.

Right off the bat, it’s evident that it’s harder for younger individuals to get a high score, but they fare better in the MLR model than they do in the Logistic model. Imagine a 45-degree line sloping from the top left corner to the bottom right corner — the presence of more dark-shaded squares under that line indicates individuals with higher MLR scores than Logistic scores. The logistic model, on the other hand, slightly favours older alumni. This alone might explain why the Logistic model outperformed the MLR model in terms of the validation set. The difference might be due to how age-related variables were introduced to each model as predictors; they may have been more influential in one than the other. It’s hard to say without going back to the models themselves for a close look.

One can spend a lot of time playing and learning with these filters. Let’s fast-forward and (finally) introduce recent-gift data — the giving that all scored individuals have engaged in since September 1, the day after the models were supposedly created. This data appears in the Measures pane as a variable I’ll call ‘Sum of Giving’. I’m specifically interested in who has given at least $1,000 (cumulatively), so I will need to create a calculated field to flag these people.

  1. Right-click on Sum of Giving and select Create Calculated Field…
  2. Give the field a name. I called it “Leadership donor”.
  3. The field Sum of Giving is already in the expression window. Now you just need to add some text around it to complete the expression:
  4. Click OK. This creates a field (variable) with the value 1 for any donor who has given at the Leadership level, and nothing if otherwise. Note that you can enter any amount in place of 999. If you want to count donors vs. non-donors, enter “>0”.
  5. The field appears in the Measures pane, because Tableau recognizes it as numeric. We’re using it as a categorical variable, so let’s convert it into a Dimension instead. Right-click on the field name and select “Convert to Dimension”, or simply drag the field into the Dimensions pane — both actions accomplish the same thing.

Now we have a flag we can use to zero in on our higher-end donors. Let’s create a new view for that. At the lower left of your screen, right-click on the tab for the existing view and select “Duplicate Sheet”. This will allow us to continue exploring the heat map without changing our original version. We could, of course, do all our work in a single view and use filters to dynamically alter the view — that’s one of the strengths of Tableau — but for now let’s keep our views separate.

  1. If you still have filters applied for Age or other variables, click on the quick filter menu and select “Clear Filter”. You can reapply it later if you want — we’re just getting it out of the way so we can see the full picture.
  2. Drag ‘Leadership donor’ to the Filters shelf.
  3. In the box that pops up, click “Select from list” on the General tab (it should already be selected), and then check the little box for ‘1’.
  4. Click OK.

The result looks like this. (Click for full size.)

Our big donors are clustered nicely down in the lower right corner, where both the MLR and the Logistic model scores are very high. Some of the lower-score bins contain zero Leadership donors, and Tableau has automatically hidden those rows and columns from view. Take a couple of minutes to study the map. Follow the three darkest squares (labeled 48, 74, and 23) as they form a 45-degree line up the centre of the map. If you compare the values in the squares that are directly opposite each other over this line, you’ll notice that there are slightly more Leadership donors on the upper side of the line. Those are donors who have higher Logistic scores than MLR scores. As well, notice that the scattered cloud of donors above the line is more extensive than that below the line. These observations should lead us to believe that the Logistic model performs slightly better than the MLR model.

That conclusion is a bit hasty, though. There might be more Leadership donors on the high-Logistic/low-MLR side simply because more alumni ended up in those squares in the first place. We need to calculate the PERCENTAGE of the population of each square that went on to become a Leadership donor. That’s right, we’re going to create a third view, and calculate percentages to plug into each square.

  1. Right-click on the tab for Sheet 2 and select Duplicate Sheet. (By the way, you can name these sheets whatever you want, just as in Excel.)
  2. Remove the filter for Leadership donor.
  3. Under Analysis in the top menu bar, select Create Calculated Field…
  4. Name the new field ‘Leadership percentage’.
  5. Enter this expression, which divides the number of Leadership donors by the total number of individuals.
  6. Click OK. The new field appears in the Measures pane, which is fine.
  7. Drag ‘Leadership percentage’ from Measures onto the Label shelf, replacing the count of ID.
  8. Drag ‘Leadership percentage’ from Measures again, this time onto the Color shelf.
  9. Right-click on any square in the map, and select Format…, which opens a formatting pane at the far left.
  10. On the Pane tab, in the Default section, click on the down-arrow to the right of “Numbers”, and select Percentage.

The result is below. (Click for full size.) You can select any precision for your percentages — I’ve rounded to whole numbers to avoid clutter.

The darkest square is a single donor with a very high MLR score but a very low Logistic score, who just happened to give at the Leadership level. That square is of course labeled 100%, which causes the rest of the display to be toned down to a degree that makes it hard to see the patterns. This single donor might be a person to look at more carefully, but for now, let’s exclude that person from the map. Hover your pointer over the square, and select Exclude from the tooltip box. (This creates a specific filter for this individual, which you can remove anytime.) All the squares are recoloured accordingly:

Now some of the darkest squares are also based on very sparse data. You can exclude any that you wish, but I’m fine with this display for now. For one thing, we can clearly see that having a Logistic score of 95 or higher is darn significant, regardless of what a donor’s MLR score is. For example, there are four Leadership donors who scored only 65-69 in the MLR model but have Logistic scores of 95-99, which is what we want to see. (Those donors are in the square labeled 14%.)

Being able to demonstrate that one model is superior is pretty nifty. But I am especially intrigued at how easy it is to see how the models might work together to improve accuracy.

Have a look at the square containing individuals who scored 100 in both models. There were 147 such individuals, and 48 of them gave $1,000 or greater — a whopping 32.6%. Here are a couple of facts to think about:

  • Of all the individuals who scored 100 in the Logistic model, 26.7% went on to give at the Leadership level.
  • Of all the individuals who scored 100 in the MLR model, 23.1% went on to give at the Leadership level.

Do you see what I’m getting at? When we combine both scores and zero in on people in the top percentile for both models, our yield of Leadership donors increases by nearly six percentage points over the best-performing model, to 32.6%.

The same boost is evident for other high-scoring cells in the heat map: The logistic model identifies some big donors that the MLR model misses, but the MLR model can enhance the accuracy of the logistic model. This is potentially useful for prospect identification in Major Giving, when we really want to be as focused as possible.

So far I’ve shown you only donor numbers. What about revenue? Our data set includes gift amounts, so let’s create a new view to visualize actual aggregate dollar totals.

  1. Duplicate the last sheet you created, and remove any filters that had been applied.
  2. Drag ‘Sum of Giving’ to the Label and Color shelves.
  3. Format the values as currency.
  4. For fun, change the color from green to red by clicking on Edit Colors in the context menu for the Sum of Giving card.

The result is pretty dramatic.

This is for all donors, not just Leadership donors, but if you want to narrow it down to Leadership donors only, re-apply your filter.

Just as with raw donor counts, the view above is a little misleading, simply because more prospects equals more donors, equals more dollars. So let’s create a calculated field to give us AVERAGE dollars per donor for every cell in the heat map.

The individuals with scores of 100 in both models gave nearly $5,000 on average — no other cell comes close. But guess what’s even better:

  • The individuals who scored 100 in the Logistic model gave an average of $2,927.
  • The individuals who scored 100 in the MLR model gave an average of $2,971.

The models are strongest where they intersect!

I’ve spent a lot of time and more than 4,000 words explaining how to do this in Tableau. This is very unusual for me — why a specific product such as Tableau, when one can create heat maps even in Excel? *

  • It’s just so easy to do it in Tableau, and the result looks attractive without requiring the user to fuss with formatting.
  • The data can be refreshed whenever necessary. If you’re connecting to an Excel file, simply paste new data into the file and refresh the data extract. It’s that simple. (Remember to refresh the extract rather than replace the data source entirely, if you want to retain your aliases as you’ve defined them.)
  • That goes for refreshing the giving data, AND for loading a whole different set of individuals and scores. You don’t need to rebuild these views from scratch (although it’s pretty easy to do so).
  • Tableau allows you to dynamically filter the data any which way you want. It’s a great way to explore the data. In my example, it would have been really interesting to filter on donors who UPGRADED to the $1,000+ level. Which model did a better job predicting upgrading? I don’t know, but I’m going to find out.
  • You can drill down to the underlying data. If you want to see a list of the people who scored 100 in both models, just hover the pointer over that square and click on the data icon, then the ‘Underlying’ tab. Imagine having wealth/capacity scores on one axis, and propensity scores on the other …
  • I’ve shared my heat maps here as static images, but you can share your analyses as fully-functioning views, even with people who don’t have the software on their computers. Save it as a Packaged Workbook, and they’ll be able to open it in Tableau Reader (which they can download for free). They can use the filters you’ve set up to play with the data themselves.

This may be the longest CoolData post ever, but as usual I feel I am barely scratching the surface.

* P.S.: Heat maps are easily created in a combination of Data Desk and Excel. Without going into too much detail: In Data Desk use contingency tables (a.k.a. cross tabs) to create the basic matrix of numbers, with one score set as x and the other as y, and use derived variable expressions to limit the counts as desired. Copy and paste the table text into Excel, and use conditional formatting to create the desired shading. Unfortunately this requires some fussing and the result is static.

24 March 2010

More and more and more models

Filed under: Annual Giving, Model building — Tags: , — kevinmacdonell @ 12:30 pm

In the back of my head I keep a list of predictive models that I would like to build someday. It just makes sense: I started out by creating a general model for predicting any kind of alumni giving to our institution, and now I create models targeted at specific types of giving (Planned Giving and Phonathon), and one specific type of behaviour (event attendance). From here, all kinds of possibilities suggest themselves.

In the area of annual giving, I’m very interested in the predictability of increasing pledge size year-over-year. The Phonathon model I created does a fantastic job of predicting who will participate (whether or not they’ve ever given before), and who will give the most. Of course this delights and amazes me, but these days I’m getting excited by how well the model predicts which of last year’s donors will make a larger pledge in the current year. (I’ve already written about this: “Predicting who will increase their pledge“.) I’m wondering if I tagged the known Increasers with an indicator variable, could I train a decent model on them, and do an even better job of predicting who might respond to a boost in target ask?

On the negative side, I’ve also wondered if it would make any sense to create a model for “likelihood to default on a pledge“. This would be taking a page from the risk-scoring models I imagine banks, insurance companies, and credit card companies must use to flag customers most likely to default on a loan, file a claim, or engage in fraud. Of course, the question is not “can it be done?”, but “what business problem would this solve?” What would we do if we knew which donors are most likely to fail to follow through on their pledge? Would we start sending out reminders earlier? Would we remind more often? Would we do follow-up calling? Those are questions I’d have to ask people who work in Annual Giving.

And finally, I’m intrigued by the idea of creating a “mid-range gift propensity” model. It’s my impression that prospects with capacity to give at the very high end of the Annual Giving scale, but yet who fall short of the Major Gift threshold, are an overlooked and untapped resource. The dollar values will differ from institution to institution, but I’m thinking of giving in the range of $10,000 as an outright gift, to perhaps $25,000 over five years. Now, no model I create is ever going to say anything about capacity. But I strongly suspect that there is a segment of the alumni population who feel motivated to give much more than they do, but who need to be asked differently. Some donors will respond generously to a call from a student, but frankly I don’t think the generic student call is going to cut it for most people in this category.

Again, creating such a model requires coordination with fundraisers over what is meant by “different asking”. Perhaps it involves staff, or maybe peer calling (a former classmate who volunteers to make calls?). I don’t know. But if I were to suggest an Annual Giving experiment with a high potential ROI, this would be it.

What’s YOUR fantasy predictive model?

Create a free website or blog at WordPress.com.