CoolData blog

7 January 2015

New finds in old models

 

When you build a predictive model, you can never be sure it’s any good until it’s too late. Deploying a mediocre model isn’t the worst mistake you can make, though. The worst mistake would be to build a second mediocre model because you haven’t learned anything from the failure of the first.

 

Performance against a holdout data set for validation is not a reliable indicator of actual performance after deployment. Validation may help you decide which of two or more competing models to use, or it may provide reassurance that your one model isn’t total junk. It’s not proof of anything, though. Those lovely predictors, highly correlated with the outcome, could be fooling you. There are no guarantees they’re predictive of results over the year to come.

 

In the end, the only real evidence of a model’s worth is how it performs on real results. The problem is, those results happen in the future. So what is one to do?

 

I’ve long been fascinated with Planned Giving likelihood. Making a bequest seems like the ultimate gesture of institutional affinity (ultimate in every sense). On the plus side, that kind of affinity ought to be clearly evidenced in behaviours such as event attendance, giving, volunteering and so on. On the negative side, Planned Giving interest is uncommon enough that comparing expectancies with non-expectancies will sometimes lead to false predictors based on sparse data. For this reason, my goal of building a reliable model for predicting Planned Giving likelihood has been elusive.

 

Given that a validation data set taken from the same time period as the training data can produce misleading correlations, I wondered whether I could do one better: That is, be able to draw my holdout sample not from data of the same time period as that used to build the model, but from the future.

 

As it turned out, yes, I could.

 

Every year I save my regression analyses as Data Desk files. Although I assess the performance of the output scores, I don’t often go back to the model files themselves. However, they’re there as a document of how I approached modelling problems in the past. As a side benefit, each file is also a snapshot of the alumni population at that point in time. These data sets may consist of a hundred or more candidate predictor variables — a well-rounded picture.

 

My thinking went like this: Every old model file represents data from the past. If I pretend that this snapshot is really the present, then in order to have access to knowledge of the future, all I have to do is look at today’s data stored in the database.

 

For example, for this blog post, I reached back two years to a model I created in Data Desk for predicting likelihood to upgrade to the Leadership level in Annual Giving. I wasn’t interested in the model itself. Rather, I wanted to examine the underlying variables I had to work with at the time. This model had been an ambitious undertaking, with some 170 variables prepared for analysis. Many of course were transformations of variables or combinations of interacting variables. Among all those variables was one indicating whether a case was a current Planned Giving expectancy or not, at that point in time.

 

In this snapshot of the database from two years ago, some of the cases that were not expectancies would have become so since then. In other words, I now had the best of both worlds. I had a comprehensive set of potential predictors as they existed two years ago, AND access to the hitherto unknowable future: The identities of the people who had become expectancies after the predictors had been frozen in time.

 

As I said, my old model was not intended to predict Planned Giving inclination. So I built a new model, using “Is an Expectancy” (0/1) as the target variable. I trained the regression model on the two-year-old expectancy data — I didn’t even look at the new expectancies while building the model. No: I used those new expectancies as my validation data set.

 

“Validation” might be too strong a word, given that there were only 80 or so new cases. That’s a lot of bequest intentions, for sure, but in terms of data it’s a drop in the bucket compared with the number of cases being scored. Let’s call it a test data set. I used this test set to help me analyze the model, in a couple of ways.

 

First I looked at how new expectancies were scored by the model I had just built. The chart below shows their distribution by score decile. Slightly more than 50% of new expectancies were in the top decile. This looks pretty good — keeping in mind that this is what actual performance would have looked like had I really built this model two years ago (which I could have):

 

 

new_expec

(Even better, looking at percentiles, most of the expectancies in that top 10% are concentrated nicely in the top few percentiles.)

 

But I didn’t stop there. It is also evident that almost half of new expectancies fell outside the top 10 percent of scores, so clearly there was room for improvement. My next step was to examine the individual predictors I had used in the model. These were of course the predictors most highly correlated with being an expectancy. They were roughly the following:
  • Year person’s personal information in the database was last updated
  • Number of events attended
  • Age
  • Year of first gift
  • Number of alumni activities
  • Indicated “likely to donate” on 2009 alumni survey
  • Total giving in last five years (log transformed)
  • Combined length of name Prefix + Suffix

 

I ranked the correlation of each of these with the 0/1 indicator meaning “new expectancy,” and found that most of the predictors were still fine, although they changed their order in the rank correlation. Donor likelihood (from survey) and recent giving were more important, and alumni activities and how recently a person’s record was updated were less important.

 

This was interesting and useful, but what was even more useful was looking at the correlations between ALL potential predictors and the state of being a new expectancy. A number of predictors that would have been too far down the ranked list to consider using two years ago were suddenly looking much better. In particular, many variables related to participation in alumni surveys bubbled closer to the top as potentially significant.

 

This exercise suggests a way to proceed with iterative, yearly improvements to some of your standard models:
  • Dig up an old model from a year or more ago.
  • Query the database for new cases that represent the target variable, and merge them with the old datafile.
  • Assess how your model performed or, if you created more than one model, see which model would have performed best. (You should be doing this anyway.)
  • Go a layer deeper, by studying the variables that went into those models — the data “as it was” — to see which variables had correlations that tricked you into believing they were predictive, and which variables truly held predictive power but may have been overlooked.
  • Apply what you learn to the next iteration of the model. Leave out the variables with spurious correlations, and give special consideration to variables that may have been underestimated before.
Advertisement

12 July 2012

Evaluate models with fresh data using Tableau heat maps

When I build predictive models, I normally don’t build just one for each purpose. Presumably the model is going to be used, so I want it to be the best one possible. Yes, I test the model scores against a holdout data sample, but if I built only one model, I wouldn’t have anything solid on which to base my evaluation of the results. I might reject a lone model if it truly failed against the validation set, but that has never happened to me — even a lackluster performance can be better than nothing, and therefore the model is flawed, but useful. That statement is true of models in general. So testing results with nothing to compare against is pointless.

I usually produce one multiple linear regression model and one binary logistic regression model using the stats software package Data Desk. Many permutations are possible, though: The sample to be scored can be limited in various ways, and the dependent variable can be formulated any number of ways. The choice of technique (for me, one type of regression or another) is usually determined by the nature of the DV (though not always). Given unlimited time, I would produce multiple models, but doing two at a time is manageable and keeps the task of comparison simple. The model that does the best classifying the members of the holdout sample wins the prize, and the loser is discarded.

But there’s a problem. I’ve never had a model bomb when it comes to scoring the validation set, but I HAVE had models fail after deployment. Data that is held out for validation of the model is one thing — the real world outside the model can be a whole OTHER thing. Logically it should not be so: If the model doesn’t “know” anything about the holdout data, then you’d think its performance on it would indicate how it will perform in the future.

Not so. At least, not always.

I am not so quick, then, to discard the loser. I like to evaluate both models on fresh data as it comes in (new gifts, for example). The loser might be the better choice overall, or it might turn out that a combination of the two models performs better than one on its own. Maybe one model works better for a subset of the population (young alumni, say), which suggests that adding interaction terms or even using a multiple-model approach is something to consider in the future. If the models predict slightly different propensities (as a result of how the DVs were formulated), with both of them contributors to a desirable result, then it might be worthwhile keeping both score sets by multiplying them together.

I don’t have an extended period of time for such testing — the model needs to be put into operation before it gets stale. Unfortunately, evaluation has always been a cumbersome process. I need to query the database for fresh results (conversions, upgrades, new planned giving expectancies — whatever) and then match it up by ID and score for each model (scores for untested models are not going to be in the database, obviously), and then produce some charts in Excel to visualize and compare results. It’s not a ton of work, but it takes just long enough to prevent me from doing it more than once before it’s time to commit. Even if I am evaluating the models after the fact, in order to learn for the next iteration of model-building, it’s not an exercise I will want to carry out repeatedly.

There is a better way. Think reports.

What does a report do? A report pulls real-time (or nightly-refreshed) data and assembles it in an interpretable way in a tabular or visual display. It performs this service on a regular or semi-regular basis, or on-demand. (Yeah, okay, maybe I should have said an ideal report). If part of your job consists in report preparation as well as predictive modeling, then you should be building model scores into your reports.

Here’s a tutorial on how to use Tableau to easily create a report that compares the performance of two sets of model scores in a single visualization called a heat map. This visualization can be refreshed with live data as often as desired. If you want, you can add other fields (age, sex, degree, donor status, etc.) and easily filter the data to see how model performance differs depending on the composition of the population. Note that this is probably not a report you’ll be sharing with your vice president. It does look cool, but it is mainly a diagnostic and exploration tool for your own use. The small initial investment of time is worth it if you build multiple models — it can be reused again and again.

This tutorial assumes you’re already somewhat familiar with the basics of Tableau. If you don’t have the software, and you don’t want to download a free trial, stick around anyway — other software packages offer ways to create heat maps, and the basic idea is the same.

In this example, I am comparing percentile scores from two models I developed to predict which alumni are most likely to give at least $1,000 in the current fiscal year. One is a multiple linear regression model with a dependent variable defined as the sum of giving for the past five years (log-transformed). The other is a logistic regression model with a binary dependent variable defined as ‘has giving of at least $1,000 in any one of the past three years’. The exact definitions of the DVs are reasonable but somewhat arbitrary. They are closely related, but different. The techniques and the predictor variables are also different, so we should expect the models to yield different results. Tested against the validation set (which was the same for both models), the logistic model proved superior. But only a test on new gift data will be truly convincing.

I want to take the entire population of alumni whom I have scored (a sample of about 27,000 individuals), and match them up with what they have given since the model was created. In this made-up example, let’s suppose I created my models last August, and I want to see what those 27,000 alumni have given since the day I completed the work. In reality, I would have chosen a winning model months ago and this would be an after-the-fact analysis, but I am doing this in order to enrich the visualization for the purposes of this example. (Cheating, in other words.)

Tableau allows you to combine data from multiple sources. In this case, you will connect to an Excel file to get your model scores (since they’re not in the database), and then connect to your database for giving results since September 1. If you do not connect directly to your database from Tableau, then you can paste your gifts data into a second sheet in your Excel workbook and extract the data via a single connection to that file — no problem. The first worksheet will have three columns: One for unique ID, and one each for the scores from the two models. In this example, the scores were output from Data Desk as percentiles. If you want, you can add columns for key attributes such as age, sex and so on. The second worksheet (or the custom SQL that retrieves data directly from your data warehouse) will provide ID and sum of giving since September 1.

Normally in report creation, Tableau handles all the aggregation of the data — the input is raw transaction data, with each ID potentially appearing on multiple rows. In this example, however, we have aggregated the data already (summing giving by ID), and there is only one row of data for each ID. It doesn’t matter, but it might have implications for some of the specific steps that follow.

You should refer to your Tableau references for connecting to data sources. All I will add is that when you add the table (or worksheet) that contains the giving data, be sure to left-join on ID, because obviously not everyone you have scored has given since Sept. 1. From here on in, I will use Tableau terminology that won’t make any sense if you don’t know the software (specifically, Tableau Desktop version 7.0). Let’s build our first view:

  1. If your data has been extracted correctly, ‘ID’ will be listed under Dimensions, and your two model score sets will be listed under Measures. In this example, I will from now on refer to them as MLR (for Multiple Linear Regression) and Logistic. Obviously I’m referring to my own data — just try to replicate what I’m talking about using your own data file.
  2. For now, pause Auto Updates (or turn off automatic updates).
  3. Right-click on Logistic and select “Create bins …” This will bin the percentile score into whatever size we desire. Change the default bin size to 5 and click OK. Note that a new variable is created in the Dimensions pane, because bins are categorical, not numerical.
  4. Right-click on MLR and do the same thing.
  5. Drag Logistic (bins) to the Columns shelf. Drag MLR (bins) to the Rows shelf.
  6. Drag ID to the Text shelf. Click on the down-arrow of the ID pill you’ve just created, and select Measure –> Count. This will create a count of all IDs that fall into each cell. It turns green to indicate it’s now a measure instead of a dimension. (Because each ID appears in our data only once, it doesn’t matter whether we use either Count or Count Distinct.)
  7. Change the Marks type from Automatic to Square (right above the Text shelf). Notice that the Text shelf suddenly turns into a Label shelf — each square of the heat map will be labeled with the number of IDs.
  8. Drag ID from the Dimensions pane again, and this time drop it onto the Color shelf.
  9. Click on the down-arrow of the ID pill you’ve just created, and select Measure –> Count. This will base the color or shading of the cell on the number of IDs that fall into that cell.

The top left corner of your screen will look like this:

Now we’re ready to allow the view to automatically update. The result won’t look much like a heat map: Probably just a bunch of little squares with numbers beside them. We need to enlarge the squares. Under the Size shelf is a slider: Move this to the centre of the size range. Then drag one of the rows in the view to make it taller — hover over the axis for MLR (on the far left) until the pointer turns into an up-and-down arrow, then click and drag. When you let go, the squares will resize and the alleys of white space should start to close up. Keep messing with it until the squares touch on all sides. With a little formatting of labels for readability, the final product will look something like this. (Click on thumbnail image for full size.)

A heat map can convey a lot of information at a glance. You can immediately see where a lot of individuals are concentrated: They’re in the darkest squares. The numbers are hard to read, but up in the top left of the map, we see that the number of people who fall into the 0-4 bin in both the MLR and Logistic models is 572. In the lower right area of the map, we see that 563 people fell into the 95 to 99 bin in both models. Notice that Tableau didn’t bin evenly: Every single bin has 5 score levels in it except for the bin labeled 100, which contains only individuals with a score of 100. In the map, we see that 147 people scored exactly 100 in both models. This can be corrected (using a calculated field instead of automatic binning), but I have decided to leave it the way it is. Due to the nature of this modeling exercise, I am mainly interested in the top few percentile scores anyway, and the 100 group is of particular interest. Having them mapped separately from the rest is not a problem.

The names of the bins don’t reflect what they include. For example, “90” really means “90 to 94”. You can rename them using aliases. Right-click on Logistic in the Dimensions pane, select Field Properties –> Aliases…, and change the displayed values in the Values column. Do the same for MLR.

We haven’t looked at the recent-gift data yet, but before we move on, what can we learn from this view? It appears the models agree on the individuals with extremely high or extremely low scores. In the middle range, there is still a lot of agreement but also many more cases of divergence, in which an individuals scores high in one model but low in the other. This is clear, at-a-glance evidence that our models are similar but different. Depending on the application, choosing one model over the other could have a big effect on the result, for better or worse. In this particular application, where I am interested mainly in very high-scoring alumni only, it may not make that much difference at all … but let’s not jump to that conclusion just yet.

If your data set included some key grouping information such as age or sex, it might be interesting to create a filter to examine whether the models differ on those factors. Here’s an example with ‘Age’:

  1. Drag Age from the Measures pane into the Filters shelf.
  2. When Tableau asks you how you want to filter on Age, select “All Values” and click Next.
  3. On the next box, select Range of Values, and click OK.
  4. Hover over the green Age pill on the filters shelf, click the down-arrow on the right end of the pill, and select Show Quick Filter.

Now you can set the upper and lower age bounds of the individuals you want to be counted in the heat map. As you slide the scale, it will display Age with numbers after the decimal, even though your values are all whole numbers. If this bothers you, right-click on Age in the Measures pane, select Field Properties –> Number Format…, and click on Number (Custom). Adjust the number of decimal places to zero. Here’s what the quick filter looks like:

The next two images show the heat map for different age ranges. The first one is ages 20 to 50, the second is 51 to 80. Again, click on the thumbnails for full-size images — although the beauty of a heat map is that you can see the pattern from a distance.

Right off the bat, it’s evident that it’s harder for younger individuals to get a high score, but they fare better in the MLR model than they do in the Logistic model. Imagine a 45-degree line sloping from the top left corner to the bottom right corner — the presence of more dark-shaded squares under that line indicates individuals with higher MLR scores than Logistic scores. The logistic model, on the other hand, slightly favours older alumni. This alone might explain why the Logistic model outperformed the MLR model in terms of the validation set. The difference might be due to how age-related variables were introduced to each model as predictors; they may have been more influential in one than the other. It’s hard to say without going back to the models themselves for a close look.

One can spend a lot of time playing and learning with these filters. Let’s fast-forward and (finally) introduce recent-gift data — the giving that all scored individuals have engaged in since September 1, the day after the models were supposedly created. This data appears in the Measures pane as a variable I’ll call ‘Sum of Giving’. I’m specifically interested in who has given at least $1,000 (cumulatively), so I will need to create a calculated field to flag these people.

  1. Right-click on Sum of Giving and select Create Calculated Field…
  2. Give the field a name. I called it “Leadership donor”.
  3. The field Sum of Giving is already in the expression window. Now you just need to add some text around it to complete the expression:
  4. Click OK. This creates a field (variable) with the value 1 for any donor who has given at the Leadership level, and nothing if otherwise. Note that you can enter any amount in place of 999. If you want to count donors vs. non-donors, enter “>0”.
  5. The field appears in the Measures pane, because Tableau recognizes it as numeric. We’re using it as a categorical variable, so let’s convert it into a Dimension instead. Right-click on the field name and select “Convert to Dimension”, or simply drag the field into the Dimensions pane — both actions accomplish the same thing.

Now we have a flag we can use to zero in on our higher-end donors. Let’s create a new view for that. At the lower left of your screen, right-click on the tab for the existing view and select “Duplicate Sheet”. This will allow us to continue exploring the heat map without changing our original version. We could, of course, do all our work in a single view and use filters to dynamically alter the view — that’s one of the strengths of Tableau — but for now let’s keep our views separate.

  1. If you still have filters applied for Age or other variables, click on the quick filter menu and select “Clear Filter”. You can reapply it later if you want — we’re just getting it out of the way so we can see the full picture.
  2. Drag ‘Leadership donor’ to the Filters shelf.
  3. In the box that pops up, click “Select from list” on the General tab (it should already be selected), and then check the little box for ‘1’.
  4. Click OK.

The result looks like this. (Click for full size.)

Our big donors are clustered nicely down in the lower right corner, where both the MLR and the Logistic model scores are very high. Some of the lower-score bins contain zero Leadership donors, and Tableau has automatically hidden those rows and columns from view. Take a couple of minutes to study the map. Follow the three darkest squares (labeled 48, 74, and 23) as they form a 45-degree line up the centre of the map. If you compare the values in the squares that are directly opposite each other over this line, you’ll notice that there are slightly more Leadership donors on the upper side of the line. Those are donors who have higher Logistic scores than MLR scores. As well, notice that the scattered cloud of donors above the line is more extensive than that below the line. These observations should lead us to believe that the Logistic model performs slightly better than the MLR model.

That conclusion is a bit hasty, though. There might be more Leadership donors on the high-Logistic/low-MLR side simply because more alumni ended up in those squares in the first place. We need to calculate the PERCENTAGE of the population of each square that went on to become a Leadership donor. That’s right, we’re going to create a third view, and calculate percentages to plug into each square.

  1. Right-click on the tab for Sheet 2 and select Duplicate Sheet. (By the way, you can name these sheets whatever you want, just as in Excel.)
  2. Remove the filter for Leadership donor.
  3. Under Analysis in the top menu bar, select Create Calculated Field…
  4. Name the new field ‘Leadership percentage’.
  5. Enter this expression, which divides the number of Leadership donors by the total number of individuals.
  6. Click OK. The new field appears in the Measures pane, which is fine.
  7. Drag ‘Leadership percentage’ from Measures onto the Label shelf, replacing the count of ID.
  8. Drag ‘Leadership percentage’ from Measures again, this time onto the Color shelf.
  9. Right-click on any square in the map, and select Format…, which opens a formatting pane at the far left.
  10. On the Pane tab, in the Default section, click on the down-arrow to the right of “Numbers”, and select Percentage.

The result is below. (Click for full size.) You can select any precision for your percentages — I’ve rounded to whole numbers to avoid clutter.

The darkest square is a single donor with a very high MLR score but a very low Logistic score, who just happened to give at the Leadership level. That square is of course labeled 100%, which causes the rest of the display to be toned down to a degree that makes it hard to see the patterns. This single donor might be a person to look at more carefully, but for now, let’s exclude that person from the map. Hover your pointer over the square, and select Exclude from the tooltip box. (This creates a specific filter for this individual, which you can remove anytime.) All the squares are recoloured accordingly:

Now some of the darkest squares are also based on very sparse data. You can exclude any that you wish, but I’m fine with this display for now. For one thing, we can clearly see that having a Logistic score of 95 or higher is darn significant, regardless of what a donor’s MLR score is. For example, there are four Leadership donors who scored only 65-69 in the MLR model but have Logistic scores of 95-99, which is what we want to see. (Those donors are in the square labeled 14%.)

Being able to demonstrate that one model is superior is pretty nifty. But I am especially intrigued at how easy it is to see how the models might work together to improve accuracy.

Have a look at the square containing individuals who scored 100 in both models. There were 147 such individuals, and 48 of them gave $1,000 or greater — a whopping 32.6%. Here are a couple of facts to think about:

  • Of all the individuals who scored 100 in the Logistic model, 26.7% went on to give at the Leadership level.
  • Of all the individuals who scored 100 in the MLR model, 23.1% went on to give at the Leadership level.

Do you see what I’m getting at? When we combine both scores and zero in on people in the top percentile for both models, our yield of Leadership donors increases by nearly six percentage points over the best-performing model, to 32.6%.

The same boost is evident for other high-scoring cells in the heat map: The logistic model identifies some big donors that the MLR model misses, but the MLR model can enhance the accuracy of the logistic model. This is potentially useful for prospect identification in Major Giving, when we really want to be as focused as possible.

So far I’ve shown you only donor numbers. What about revenue? Our data set includes gift amounts, so let’s create a new view to visualize actual aggregate dollar totals.

  1. Duplicate the last sheet you created, and remove any filters that had been applied.
  2. Drag ‘Sum of Giving’ to the Label and Color shelves.
  3. Format the values as currency.
  4. For fun, change the color from green to red by clicking on Edit Colors in the context menu for the Sum of Giving card.

The result is pretty dramatic.

This is for all donors, not just Leadership donors, but if you want to narrow it down to Leadership donors only, re-apply your filter.

Just as with raw donor counts, the view above is a little misleading, simply because more prospects equals more donors, equals more dollars. So let’s create a calculated field to give us AVERAGE dollars per donor for every cell in the heat map.

The individuals with scores of 100 in both models gave nearly $5,000 on average — no other cell comes close. But guess what’s even better:

  • The individuals who scored 100 in the Logistic model gave an average of $2,927.
  • The individuals who scored 100 in the MLR model gave an average of $2,971.

The models are strongest where they intersect!

I’ve spent a lot of time and more than 4,000 words explaining how to do this in Tableau. This is very unusual for me — why a specific product such as Tableau, when one can create heat maps even in Excel? *

  • It’s just so easy to do it in Tableau, and the result looks attractive without requiring the user to fuss with formatting.
  • The data can be refreshed whenever necessary. If you’re connecting to an Excel file, simply paste new data into the file and refresh the data extract. It’s that simple. (Remember to refresh the extract rather than replace the data source entirely, if you want to retain your aliases as you’ve defined them.)
  • That goes for refreshing the giving data, AND for loading a whole different set of individuals and scores. You don’t need to rebuild these views from scratch (although it’s pretty easy to do so).
  • Tableau allows you to dynamically filter the data any which way you want. It’s a great way to explore the data. In my example, it would have been really interesting to filter on donors who UPGRADED to the $1,000+ level. Which model did a better job predicting upgrading? I don’t know, but I’m going to find out.
  • You can drill down to the underlying data. If you want to see a list of the people who scored 100 in both models, just hover the pointer over that square and click on the data icon, then the ‘Underlying’ tab. Imagine having wealth/capacity scores on one axis, and propensity scores on the other …
  • I’ve shared my heat maps here as static images, but you can share your analyses as fully-functioning views, even with people who don’t have the software on their computers. Save it as a Packaged Workbook, and they’ll be able to open it in Tableau Reader (which they can download for free). They can use the filters you’ve set up to play with the data themselves.

This may be the longest CoolData post ever, but as usual I feel I am barely scratching the surface.

* P.S.: Heat maps are easily created in a combination of Data Desk and Excel. Without going into too much detail: In Data Desk use contingency tables (a.k.a. cross tabs) to create the basic matrix of numbers, with one score set as x and the other as y, and use derived variable expressions to limit the counts as desired. Copy and paste the table text into Excel, and use conditional formatting to create the desired shading. Unfortunately this requires some fussing and the result is static.

26 January 2012

More mistakes I’ve made

Filed under: Best practices, Peter Wylie, Pitfalls, Validation — Tags: , , , — kevinmacdonell @ 1:38 pm

A while back I wrote a couple of posts about mistakes I’ve made in data mining and predictive modelling. (See Four mistakes I have made and When your predictive model sucks.) Today I’m pleased to point out a brand new one.

The last days of work leading up to Christmas had me evaluating my new-donor acquisition models to see how well they’ve been working. Unfortunately, they were not working well. I had hoped — I had expected — to see newly-acquired donors clustered in the upper ranges of the decile scores I had created. Instead they were scattered all along the whole range. A solicitation conducted at random would have performed nearly as well.

Our mailing was restricted by score (roughly the top two deciles only), but our phone solicitation was more broad, so donors came from the whole range of deciles:

Very disappointing. To tell the truth, I had seen this before: A model that does well predicting overall participation, but which fails to identify which non-donors are most likely to convert. I am well past the point of being impressed by a model that tells me what everyone already knows, i.e. that loyal donors are most likely to give again. I want to have confidence that acquisition mail dollars are spent wisely.

So it was back to the drawing board. I considered whether my model was suffering from overfit, whether perhaps I had too many variables, too much random noise, multicolinearity. I studied and rejected one possibility after another. After so much effort, I came rather close to concluding that new-donor acquisition is not just difficult — it might be darn near impossible.

Dire possibility indeed. If you can’t predict conversion, then why bother with any of this?

It was during a phone conversation with Peter Wylie that things suddenly became clear. He asked me one question: How did I define my dependent variable? I checked, and found that my DV was named “Recent Donors.” That’s all it took to find where I had gone wrong.

As the name of the DV suggested, it turned out that the model was trained on a binary variable that flagged anyone who had made a gift in the past two years. The problem was that included everybody: long-time donors and newly-acquired donors alike. The model was highly influenced by the regular donors, and the new donors were lost in the shuffle.

It was a classic case of failing to properly define the question. If my goal was to identify the patterns and characteristics of newly-acquired donors, then I should have limited my DV strictly to non-donors who had recently converted to donors!

So I rebuilt the model, using the same data file and variables I had used to build the original model. This time, however, I pared the sample down to alumni who had never given a cent before fiscal 2009. They were the only alumni I needed to have scores for. Then I redefined my dependent variable so that non-donors who converted, i.e., who made a gift in either fiscal 2009 or 2010, were coded ‘1’, and all others were coded ‘0’. (I used two years of giving data instead of just one in order to have a little more data available for defining the DV.) Finally, I output a new set of decile scores from a binary logistic regression.

A test of the new scores showed that the new model was a vast improvement over the original. How did I test this? Recall that I reused the same data file from the original model. Therefore, it contained no giving data from the current fiscal year; the model was innocent of any knowledge of the future. Compare this breakdown of new donors with the one above:

Much better. Not fan-flippin-tastic, but better.

My error was a basic one — I’ve even cautioned about it in previous posts. Maybe I’m stupid, or maybe I’m just human. But like anyone who works with data, I can figure out when I’m wrong. That’s a huge advantage.

  • Be skeptical about the quality of your work.
  • Evaluate the results of your decisions.
  • Admit your mistakes.
  • Document your mistakes and learn from them.
  • Stay humble.

5 April 2011

Validation after the fact

Filed under: Model building, Phonathon, regression, Validation — Tags: , , — kevinmacdonell @ 8:11 am

Validation against a holdout sample allows us to pick the best model for predicting a behaviour of interest. (See Thoughts on model validation.) But I also like to do what I call “validation after the fact.” At the end of a fundraising period, I want to see how people who expressed that behaviour broke down by the score they’d been given.

This isn’t really validation, but if you create some charts from the results, it’s the best way to make the case to others that predictive modeling works. More importantly, doing so may provide insights into your data that will lead to improvements in your models in their next iterations.

This may be most applicable in Annual Fund, where the prospects solicited for a gift may come from a wide range of scores, allowing room for comparison. But my general rule is to compare each score level by ratios, not counts. For example, if I wanted to compare Phonathon prospects by propensity score, I would compare the percentage (ratio) of each score group contacted who made a pledge or gift, not the number of prospects who did so. Why? Because if I actually used the scores in solicitation, higher-scoring prospects will have received more solicitation attempts on average. I want results to show differences among scores, not among levels of intensity of solicitation.

So when the calling season ended recently, I evaluated my Phonathon model’s performance, but I didn’t study the one model in isolation: I compared it with a model that I initially rejected last year.

It sounds like I’m second-guessing myself. Didn’t I pick the very best model at the time? Yes, but … I would expect my chosen model to do the best job overall, but perhaps not for certain subgroups — donor types, degree types, or new grads. Each of these strata might have been better described by an alternative model. A year of actual results of fundraising gives me what I didn’t have last year: the largest validation sample possible.

My after-the-fact comparison was between a binary logistic regression model which I had rejected, and the multiple linear regression model which I actually used in Phonathon segmentation. As it turned out, the multiple linear regression model did prove the winner in most scenarios, which was reassuring. I will spare you numerous comparison charts, but I will show you one comparison where the rejected model emerged as superior.

Eight percent of never-donor new grads who were contacted made a pledge. (My definition of a new grad was any alum who graduated in 2008, 2009, or 2010.) The two charts below show how these first-time donors broke down by how they were scored in each model. Due to sparse data, I have left out score level 10.

Have a look, and then read what I’ve got to say.

Neither model did a fantastic job, but I think you’d agree that predicting participation for new grads who have never given before is not the easiest thing to do. In general, I am pleased to see that the higher end of the score spectrum delivered slightly higher rates of participation. I might not have been able to ask for more.

The charts appear similar at first glance, but look at the scale of the Y-axis: In the multiple linear regression model, the highest-scoring group (9, in this case) had a participation rate of only 12%, and strangely, the 6th decile had about the same rate. In the binary logistic regression model, however, the top scoring group reached above 16% participation, and no one else could touch them. The number of contacted new grads who scored 9 is roughly equal between the models, so it’s not a result based on relatively sparse data. The BLR model just did a better job.

There is something significantly different about either new grads, or about never-donors whom we wish to acquire as donors, or both. In fact, I think it’s both. Recall that I left the 10s out of the charts due to sparse data — very few young alumni can aspire to rank up there with older alumni using common measures of affinity. As well, when the dependent variable is Lifetime Giving, as opposed to a binary donor/nondonor state, young alumni are nearly invisible to the model, as they are almost by definition non-donors or at most fledgling donors.

My next logical step is a model dedicated solely to predicting acquisition among younger alumni. But my general point here is that digging up old alternative models and slicing up the pool of solicited prospects for patterns “after the fact” can lead to new insights and improvements.

22 March 2011

Thoughts on model validation

Filed under: Model building, Pitfalls, regression, Validation — Tags: , — kevinmacdonell @ 11:46 am

I have written very little about model validation on CoolData, probably because I’ve always had conflicting thoughts about it.

Model validation is important for avoiding serious error when predicting behaviours that are relatively rare, such as major or planned giving. But if the event is rare, the data is sparse. When I need to train the data on a relatively small number of cases, I am loathe to rob my model of half the cases I need for training.

In an annual fund model it’s no big deal. If the predicted value is participation in the fund, that’s not a rare event and there is plenty of data. There’s no angst about going the prescribed route: splitting the file into two random halves before modeling begins — a training sample used to calculate the scores, and a validation sample on which to test the validity of those scores. The irony is, most annual fund models are robust and hardly require validation.

It’s exactly when doing the tricky modeling projects (major giving, planned giving) that validation is most important. If a major-giving model proves valid, that’s great — but it might have been even stronger had I kept the data file intact. If it proves faulty, how much of that can be blamed on the fact that half my cases are unavailable?

Validation of a model has very little to do with R squared, by the way. That statistic measures only how well a multiple linear regression model fits the data set you’re working with. It doesn’t tell you how well it will perform for prediction. A very high (60% or more) or very low (10% or less) value for R squared signals trouble with the design of your model; most models will fall into that safe zone, regardless of their usefulness for prediction.

My thinking is this. Validating a model has one of two results: Either you accept the model, or you throw it out. Unless you’ve got other scoring or segmentation tricks up your sleeve, the alternative to having a predictive model is using nothing at all. How often does implementing a model prove worse than having no model? I am willing to wager: almost never. Even a score set that contains a certain amount of random noise is an improvement over not using any score. If you’re going to build a single model and use it come what may, flaws and all, why bother to validate it? Not to mention that there is no clear dividing line between a valid model and an invalid one.

The key is that I do not create one model, but several. Creating more than one model gives me options. I no longer have to compare the validity of my model against some arbitrary rule of thumb — I can compare multiple models against each other and choose the best one. Instead of splitting my data file in half and losing half of my precious training cases, I set aside a small but reasonably representative random sample of cases. This sample is held out of the model, but also scored, allowing me to compare the relative success of each model.

The benefits are twofold: One, I don’t lose nearly as many cases to the validation set, and two, I can still get some idea about how good the model is. If all the models are lousy, it’s going to show up in how the holdout cases are distributed by score.

Here is an example. Recently I needed to develop a model to predict propensity to give at higher levels. The table below summarizes how I set out to build two very different models.

Without going into too much detail, you can see that I could have chosen to build a number of different models by mixing and matching the criteria available to me:

  1. binary logistic regression or multiple linear regression?
  2. include everyone in the model (even non-donors), or a narrowed-down selection of individuals?
  3. binary dependent variable (set to represent any level of giving), or a continuous dependent variable (LT giving)?
  4. giving-related variables included or excluded?

However, in the interest of time and other practical concerns I chose only two contrasting options. As well, some choices are made for me: If I want to score non-donors in my model, I am obliged to leave out all giving-related variables, for example. (I’ve written plenty about these scenarios recently, so I won’t get into that again.)

The one thing held constant between the models I built was the holdout sample. I chose 10 current major donors at random and held them out of both models. The models were innocent of these donors’ status, but used their characteristics to assign scores to them anyway. So whichever model did a better job of giving high scores to this sample was the one I chose to use in our program.

Only ten individuals? Yes, that’s a tiny holdout sample, but when you’re trying to model for propensity to give at an exclusive level, you need to conserve your precious training cases. As it turned out, ten holdout cases was enough to reveal differences in reliability. I was surprised at which model ended up winning the trophy — but that’s a post for another time.

Blog at WordPress.com.