Testing and picking out the best predictive model would seem to be a straightforward business: You set aside a certain number of cases as a holdout sample — donors, Planned Giving expectancies, or whatever you’re trying to predict — then compare how Model A scores that holdout set versus Model B. If Model A does a better job of concentrating your known cases in the very top percentile scores, then bingo, you’re done.
I’m thinking, “Not so fast.”
Every model fails, to some degree. It’s right there in the regression equation: Y is equal to a constant plus a series of predictor Xs (multiplied by calculated coefficients) … plus error. The difference between predicted values and actual values is error, and it’s unavoidable. No set of shiny, gold-plated independent variables will account for all the variation in Y.
The practical result is that at least a few of the people who are likely to give a gift this year are going to end up with very low predictive scores. It’s my opinion that, for some models at least, our definition of success ought to be simply minimizing the failure rate. Let me show you an example for which I think this is true.
In a previous post, I gave a brief and partial description of the difference between two very different model-building methods that are both called “regression“: Multiple linear regression and binary logistic regression. Either one will give you a fine model in a variety of circumstances, so if you want to use the very best option, you’ll need to do both and compare the results.
My most recent model was built to predict propensity to give via phone solicitation. A primary goal this year is to improve participation, so I thought perhaps a logistic regression model with “Is a phone donor / Is not a phone donor” as the dependent variable would be just the ticket. But I wasn’t sure — multiple regression using “total giving by phone” was calling out to me as well. So a real test was required. Using mostly the same predictor variables, added in roughly the same order, I developed models using both types of regression.
Not only that, I developed two variations of each type. I was very uncertain about whether it would be valid practice to use “Total giving by mail” as a predictor for giving by phone. I knew this variable was strongly correlated with phone giving, but was it sufficiently independent of the DV? Everytime I thought about it, I changed my mind. “They do seem to measure different things,” I said to myself one hour. “But I’m wrong to use the state of being a donor to predict becoming a donor,” I said to myself the next. Again, a test using a holdout sample was the only way to know: If the model scored known phone donors appropriately, I would feel less conflicted about degrees of independence.
A random sample of the population, totalling approximately one percent of living, addressable alumni , or 878 individuals, was set aside prior to building the model. (Technical note: In DataDesk, the random sample was chosen via the creation of a new 0/1 variable; the values were assigned using a Bernoulli distribution that was given a probably of success of 0.01.)
The number of known Phonathon donors captured in the random sample was 98. Although they were not part of the training set, everyone in the holdout sample received predictive scores, making them suitable for use as test subjects for the effectiveness of the scores’ ability to identify good phone-receptive donors.
As I said, there were effectively four models to compare:
- Multiple linear regression, using ‘Giving by Mail’ as a predictor
- Multiple linear regression, excluding ‘Giving by Mail’ as a predictor
- Binary logistic regression, using ‘Giving by Mail’ as a predictor
- Binary logistic regression, excluding ‘Giving by Mail’ as a predictor
The models were compared on three criteria:
- Number of holdout donors in the top quartile of scores. The higher the number, the better the model.
- Number of holdout donors in the top decile of scores. Again, the higher the better.
- Number of holdout donors in the bottom quartile of scores. This time, the lower the number, the better the model.
The model that best manages to concentrate donors in the top score tiers will maximize efficiency in donor acquisition — this is the attribute that the first two criteria will zero in on. On the other hand, a model that incorrectly places holdout donors in the bottom quartile of alumni, where they risk not being solicited, costs the program donors and dollars — the third criterion helps us identify which model commits the most errors.
The table below summarizes results of the comparison. Results in yellow indicate the “winner” of each test.
What does this tell us? I’d be happy to use any of these models in my phone program. But a small difference in these test numbers could translate into material gains (or losses) when applied to the entire population.
First of all, it’s clear that the gain in predictive power from adding “Mail Giving” is substantial. That means it comes down to choosing either Model 2 or Model 4.
The fourth model appears to be most effective at identifying donors, taking the prize for concentrating the most known phone donors in both the top quartile and top decile. However, I have chosen Model number two — the multiple linear regression model.
Model 2 commits the fewest number of errors, while also doing a very acceptable job of concentrating donors in the top quartile and top decile.
Although the logistic models are better at concentrating donors in the upper score levels, it must be remembered that the Phonathon program will reach deep into the prospect pool, well below the top quartile of propensity scores. The real risk is in scoring donors so low that they are placed out of the reach of solicitation.
These numbers appear small, but I’m thinking that a difference of two overlooked donors in a hundred becomes significant when generalized to the entire prospect pool.
This begs the question: When IS the success rate more important than the failure rate? My answer is, when solicitation will reach only the very top scorers. Think of Planned Giving, and Major Giving. When only the top few percentiles are “of interest,” choose the model that concentrates your holdout donors there. Yes, you will still have error, but I’m not sure there’s much you can do about that.
As a postscript, here are four charts showing the distribution of the 98 holdout phone donors by score decile, one chart for each model tested. (Click each image for a full-size version.) Note how the addition of “Mail giving” as a predictor makes the distribution “unstable”. Does anyone have an explanation for this?