CoolData blog

20 August 2012

Logistic regression vs. multiple regression

Filed under: John Sammis, Model building, Peter Wylie, predictive modeling, regression, Statistics — kevinmacdonell @ 5:13 am

by Peter Wylie, John Sammis and Kevin MacDonell

(Click to download printer-friendly PDF: Logistic vs MR-Wylie Sammis MacDonell)

The three of us talk about this issue a lot because we encounter a number of situations in our work where we need to choose between these two techniques. Many of our late night/early morning phone/internet discussions have been gobbled up by talking about which technique seems to be better under what circumstances. More than a few times, I’ve suggested we write something up about our experience with both techniques. In the end we’ve always decided to put off doing that because … well, because we’ve thought it might put a lot of people to sleep. Disagree as we might about lots of things, we’re of one mind on the dictum: “Don’t bore people.” They have enough tedious stuff in their lives; we don’t need to add to their burden.

On the other hand, as analytics has started to sink its teeth more and more into the world of advancement, it seems there is a group of folks out there who wrestle with the same issue. And the issue seems to be this:

“If I have a binary dependent variable (e.g., major giver/ non major giver, volunteer/non-volunteer, reunion attender/non-reunion attender, etc.), which technique should I use? Logistic regression or multiple regression?”

We considered a number of ways to try to answer this question:

  • We could simply assert an opinion based on our bank of experience with both techniques.
  • We could show you the results of a number of data sets using both techniques and then offer our opinion.
  • We could show you a way to compare both techniques using some of your own data.

We chose the third option because we think there is no better way to learn about a statistical technique than by using the technique on real data. Whenever we’ve done this sort of exploring ourselves, we’ve been humbled by how much we’ve learned.

Before we show you a way to compare the two techniques, we’ll offer some thoughts on why this question (“Should I use logistic regression or multiple regression?”) is so tough to find an answer to. If you’re anxious to move on to our comparison process, you can skip this section. But we hope you don’t.

Why This Is Not an Easy Question to Find an Answer To

We see at least two reasons why this is so:

  • Multiple regression has lived in the neighborhood a long time; logistic regression is a new kid on the block.
  • The articles and books we’ve read on comparisons of the two techniques are hard to understand.

Multiple regression is a longtime resident; logistic regression is a new kid on the block.

When World War II came along, there was a pressing need for rapid ways to assess the potential of young men (and some women) for the critical jobs that the military services were trying to fill. It was in this flurry of preparation that multiple regression began to see a great deal of practical application by behavioral scientists who had left their academic jobs and joined up for the duration. The theory behind multiple regression had been worked out much earlier in the century by geniuses like Ronald Fisher, Karl Pearson, and Edward Hotelling. But the method did not get much use until the war effort necessitated that use. The computational effort involved was just too forbidding.

Logistic regression is a different story. From the reading we’ve done, logistic regression got its early practical use in the world of medicine where biostatisticians were trying to predict binary outcomes like survived/did not survive, contracted disease/did not contract disease, had a coronary event/did not have a coronary event, and the like. It’s only been within the last fifteen or twenty years that logistic regression has found its way into the parlance of statisticians in the behavioral sciences.

These two paragraphs are a long way around of saying that logistic regression is (in our opinion) nowhere near as well vetted as is multiple regression by people like us in advancement who are interested in predicting behavior, especially giving behavior.

The articles and books we’ve read on comparisons of the two techniques are hard to understand.

Since I (Peter) was pushing to do this piece, John and I decided it would be my responsibility to do some searching of the more recent literature on logistic regression as it relates to the substance of this project.

To start off, I reread portions of texts I have accumulated over the years that focus on multiple regression as a general data analytic technique. Each text has a section on logistic regression. As I waded back into these sections, I asked myself: “Is what I’m reading here going to enlighten more than confuse the folks we have in mind for this piece?”  Without exception, my answer was, “Nope, just the reverse.” There was altogether too much focus on complicated equations and theory and nowhere near enough emphasis on the practical use of logistic regression. (This, in spite of the fact that each text had an introduction ensuring us the book would go light on math and heavy on application.)

Then, using my trusty iPad, I set about seeing what I could find on the web. Not surprisingly, I found a ton of articles (and even some full length books) that had found their way into the public domain. I downloaded a bunch of them to read whenever I could find enough time to dig into them. I’m sorry to report that each time I’d give one of these things a try, I would hear my father’s voice (dad graduated third in his class in engineering school) as he paged through my own science and math texts when I was in college: “They oughta teach the clowns who wrote these things to write in plain English.” (I always tried to use such comments as excuses for bad grades. Never worked.)

Levity aside, it is hard to find clearly written articles or books on the use of logistic versus multiple regression in the behavioral sciences. I think it’s a bad situation that needs fixing, but that fixing won’t occur anytime soon. On the other hand, I think dad was right not to let me off easy for giving up on badly written material. And you shouldn’t let my pessimism dissuade you from trying out some of these same articles and books. (If enough of you are interested, perhaps Kevin and John and I can put together a list of suggested readings.)

A Way to Compare Logistic Regression with Multiple Regression

As promised we’ll take you through a set of steps you can use with some of your own data:

  1. Pick a binary dependent variable and a set of predictors.
  2. Compute a predicted probability value for every record in your sample using both multiple regression and logistic regression.
  3. Draw three random subsamples of 20 records each from the total sample so that each subsample includes the predicted multiple regression probability value and the predicted logistic regression probability value for every record.
  4. Display each subsample of these records in a table and a graph.
  5. Do an eyeball comparison of the probability values in both the tables and the graphs.

1. Pick a binary dependent variable and a set of predictors.

For this example, we used a private four year institution with about 13,000 solicitable alums. Here are the variables we chose:

Dependent variable. Each alum who had given $31 or more lifetime was defined as 1, all others who had given less than that amount were defined as 0. There were 6,293 0’s and 6,204 1’s. Just about an even fifty/fifty split.

Predictor variables:


Why did we use ID number as one of the predictors? Over the years we’ve found that many schools use all-numeric ID numbers. When these numbers are entered into a regression analysis, they often work as predictors. More importantly, they help to create very granular predicted scores that can easily be binned into equal size groups.

2. Compute a predicted probability value for every record in your sample using both multiple regression and logistic regression.

This is where things start to get a bit technical and where a little background reading on both multiple regression and logistic regression wouldn’t hurt. Again, most of the material you’ll find will be tough to decipher. Here we’ll keep it as simple as we can.

For both techniques the predicted value you want to generate is a probability, a number that varies between 0 and 1.  In this example, that value will represent the probability that a record has given $31 or more lifetime to the college.

Now here’s the rub, the logistic regression model will always generate a probability value that varies between 0 and 1. However, the multiple regression model will almost always generate a value that varies between something less than 0 (a negative number) and a number greater than 1. In fact, in this example the range of probability values for the logistic regression model extends from .037 to .948. The range of probability values for the multiple regression model extends from -.122 to 1.003.

(By the way, this is why so many statisticians advise the use of logistic regression over multiple regression when the dependent variable is binary. In essence they are saying, “A probability value can’t exceed 1 nor can it be less than 0. Since multiple regression often yields values less than 0 and greater than 1, use logistic regression.” To be fair, we’re exaggerating a bit, but not very much.)

3. Draw three random subsamples of 20 records each from the total sample so that each subsample includes the predicted multiple regression probability value and the predicted logistic regression probability value for all 20 records.

The size and number of these subsamples is, of course, arbitrary. We decided that three subsamples were better than two and that four or more would be overkill. Twenty records, as you’ll see a bit further on, is a number that allows you to see patterns in a table or graph without overcrowding the picture.

4. Display each subsample of these records in a table and a graph.

Tables 1-3 and Figures 1-3 below show how we took this step for our example. To make sure we’re being clear, let’s go through some of the details in Table 1 and Figure 1 (which we constructed for the first subsample of twenty randomly drawn records).

In Table 1 the probability values for multiple regression for each record are displayed in the left-hand column. The corresponding probability values for the same records for logistic regression are displayed in the right-hand column. For example, the multiple regression probability for the first record is .078827109. The record’s logistic regression probability is .098107437. In plain English, that means the multiple regression model for this example is saying that this particular alum has about eight chances in a hundred of giving $31 or more lifetime. The logistic regression model is saying that the same alum has about ten chances in a hundred of giving $31 or more lifetime.

Table 1: Predicted Probability Values Generated from Using Multiple Regression and Logistic Regression for the First of Three Randomly Drawn Subsamples of 20 Records

Figure 1 shows the pairs of values you see in Table 1 displayed graphically in a scatterplot. You’ll notice that the points in the scatterplot appear to fall along what roughly looks like a straight line. This means that the multiple regression model and the logistic regression model are assigning very similar probabilities to each of the 20 records in the subsample. If you study Table 1, you can see this trend, but the trend is much easier to discern in the scatter plot.

Table 2: Predicted Probability Values Generated from Using Multiple Regression and Logistic Regression for the Second of Three Randomly Drawn Subsamples of 20 Records

Table 3: Predicted Probability Values Generated from Using Multiple Regression and Logistic Regression for the Third of Three Randomly Drawn Subsamples of 20 Records


5. Do an eyeball comparison of the probability values in both the tables and the graphs.

We’ve already done such a comparison in Table 1 and Figure 1. If we do the same comparison for Tables 2 and 3 and for Figures 2 and 3, it’s pretty clear that we’ll come to the same conclusion: Multiple regression and logistic regression (for this example) are giving us very similar answers.

So Where Does This All Take Us?

We’d like to cover several topics in this closing section:

  • A frequent objection to using multiple regression versus logistic regression when the dependent variable is binary
  • Trying our approach on your own
  • The conclusion we think you’ll eventually arrive at
  • How we’ve just scratched the surface here

A frequent objection to using multiple regression versus logistic regression when the dependent variable is binary

Earlier we said that many statisticians seem to advise the use of logistic regression over multiple regression by invoking this logic: “A probability value can’t exceed 1 nor can it be less than 0. Since multiple regression often yields values less than 0 and greater than 1, use logistic regression.” We also said we were exaggerating the stance of these statisticians a bit (but not very much).

While we can understand this argument, our feeling is that, in the applied fields we toil in, that argument is not a very practical one. In fact a seasoned statistics professor we know says (in effect): “What’s the big deal? If multiple regression yields any predicted values less than 0, consider them 0. If multiple regression yields any values greater than 1, consider them 1. End of story.” We agree.

Trying our approach on your own

In this piece we’ve shown the results of one comparison between multiple and logistic regression on one set of data. It’s clear that the results we got for the two techniques were very similar. But does that mean we’d get such similar results with other examples? Not necessarily.

So here’s what we’d recommend. Try doing your own comparisons of the two techniques with:

  • Different data sets. If you’re a higher education institution, you might pick a couple of data sets, one for alums who’ve been out for more than 25 years and one for folks who’ve been out less than 10 years. If you’re a non-profit, you can use a set of members from the west coast and one from the east coast.
  • Different variables. Try different binary dependent variables like those we mentioned earlier: major giver/non major giver, volunteer/non-volunteer, reunion attender/non-reunion attender, etc. And try different predictors. Try to mix categorical variables like marital status with quantitative variables like age. If you’re comfortable with more sophisticated stats, try throwing in cross products and exponential terms.
  • Different splits in the dependent variable. In our example piece the dependent variable was almost an exact 50/50 split. Since the underlying variable we used was quantitative (lifetime giving), we could have adjusted those splits in a number of ways: 60/40, 75/25, 80/20, 95/5, and on and on the list could go. Had we tried these different kinds of splits, would we have the same kinds of results for the two techniques? Since we actually did look at different splits like these, we can report that the results for both techniques were pretty much the same. But that’s for this example. That could change with a different data set and different variables.

The conclusion we think you’ll eventually arrive at

We’re very serious about having you compare multiple regression and logistic regression on a variety of data sets with a variety of variables and with different splits in the dependent variable. If you do, you’ll learn a ton. Guaranteed.

On the other hand, if we put ourselves in your shoes, it’s easy to imagine your saying, “Come on guys. I’m not gonna do that. Just tell me what you think about which technique is better when the dependent variable is binary. Pick a winner.”

Given our experience, we can’t pick a winner. In fact, if pushed, we’re inclined to opt in favor of multiple regression for a couple of reasons. It not only seems to perform about as well as logistic regression, but more importantly (with the stats software we use) multiple regression is simply faster and easier to use than logistic regression. But we still use logistic regression for models with dependent variables. And we continue to compare its efficacy against multiple regression when we can. And we rarely see a meaningful difference between the results.

Why do we still use both modeling techniques? Because we think taking a hard and fast stance when you’re doing applied science is not a good idea. Too easy to end up with egg on your face. Our best advice is to use whichever method is most familiar and readily available to you.

As always, we welcome your comments and reactions. Maybe even more so with this one.


12 July 2012

Evaluate models with fresh data using Tableau heat maps

When I build predictive models, I normally don’t build just one for each purpose. Presumably the model is going to be used, so I want it to be the best one possible. Yes, I test the model scores against a holdout data sample, but if I built only one model, I wouldn’t have anything solid on which to base my evaluation of the results. I might reject a lone model if it truly failed against the validation set, but that has never happened to me — even a lackluster performance can be better than nothing, and therefore the model is flawed, but useful. That statement is true of models in general. So testing results with nothing to compare against is pointless.

I usually produce one multiple linear regression model and one binary logistic regression model using the stats software package Data Desk. Many permutations are possible, though: The sample to be scored can be limited in various ways, and the dependent variable can be formulated any number of ways. The choice of technique (for me, one type of regression or another) is usually determined by the nature of the DV (though not always). Given unlimited time, I would produce multiple models, but doing two at a time is manageable and keeps the task of comparison simple. The model that does the best classifying the members of the holdout sample wins the prize, and the loser is discarded.

But there’s a problem. I’ve never had a model bomb when it comes to scoring the validation set, but I HAVE had models fail after deployment. Data that is held out for validation of the model is one thing — the real world outside the model can be a whole OTHER thing. Logically it should not be so: If the model doesn’t “know” anything about the holdout data, then you’d think its performance on it would indicate how it will perform in the future.

Not so. At least, not always.

I am not so quick, then, to discard the loser. I like to evaluate both models on fresh data as it comes in (new gifts, for example). The loser might be the better choice overall, or it might turn out that a combination of the two models performs better than one on its own. Maybe one model works better for a subset of the population (young alumni, say), which suggests that adding interaction terms or even using a multiple-model approach is something to consider in the future. If the models predict slightly different propensities (as a result of how the DVs were formulated), with both of them contributors to a desirable result, then it might be worthwhile keeping both score sets by multiplying them together.

I don’t have an extended period of time for such testing — the model needs to be put into operation before it gets stale. Unfortunately, evaluation has always been a cumbersome process. I need to query the database for fresh results (conversions, upgrades, new planned giving expectancies — whatever) and then match it up by ID and score for each model (scores for untested models are not going to be in the database, obviously), and then produce some charts in Excel to visualize and compare results. It’s not a ton of work, but it takes just long enough to prevent me from doing it more than once before it’s time to commit. Even if I am evaluating the models after the fact, in order to learn for the next iteration of model-building, it’s not an exercise I will want to carry out repeatedly.

There is a better way. Think reports.

What does a report do? A report pulls real-time (or nightly-refreshed) data and assembles it in an interpretable way in a tabular or visual display. It performs this service on a regular or semi-regular basis, or on-demand. (Yeah, okay, maybe I should have said an ideal report). If part of your job consists in report preparation as well as predictive modeling, then you should be building model scores into your reports.

Here’s a tutorial on how to use Tableau to easily create a report that compares the performance of two sets of model scores in a single visualization called a heat map. This visualization can be refreshed with live data as often as desired. If you want, you can add other fields (age, sex, degree, donor status, etc.) and easily filter the data to see how model performance differs depending on the composition of the population. Note that this is probably not a report you’ll be sharing with your vice president. It does look cool, but it is mainly a diagnostic and exploration tool for your own use. The small initial investment of time is worth it if you build multiple models — it can be reused again and again.

This tutorial assumes you’re already somewhat familiar with the basics of Tableau. If you don’t have the software, and you don’t want to download a free trial, stick around anyway — other software packages offer ways to create heat maps, and the basic idea is the same.

In this example, I am comparing percentile scores from two models I developed to predict which alumni are most likely to give at least $1,000 in the current fiscal year. One is a multiple linear regression model with a dependent variable defined as the sum of giving for the past five years (log-transformed). The other is a logistic regression model with a binary dependent variable defined as ‘has giving of at least $1,000 in any one of the past three years’. The exact definitions of the DVs are reasonable but somewhat arbitrary. They are closely related, but different. The techniques and the predictor variables are also different, so we should expect the models to yield different results. Tested against the validation set (which was the same for both models), the logistic model proved superior. But only a test on new gift data will be truly convincing.

I want to take the entire population of alumni whom I have scored (a sample of about 27,000 individuals), and match them up with what they have given since the model was created. In this made-up example, let’s suppose I created my models last August, and I want to see what those 27,000 alumni have given since the day I completed the work. In reality, I would have chosen a winning model months ago and this would be an after-the-fact analysis, but I am doing this in order to enrich the visualization for the purposes of this example. (Cheating, in other words.)

Tableau allows you to combine data from multiple sources. In this case, you will connect to an Excel file to get your model scores (since they’re not in the database), and then connect to your database for giving results since September 1. If you do not connect directly to your database from Tableau, then you can paste your gifts data into a second sheet in your Excel workbook and extract the data via a single connection to that file — no problem. The first worksheet will have three columns: One for unique ID, and one each for the scores from the two models. In this example, the scores were output from Data Desk as percentiles. If you want, you can add columns for key attributes such as age, sex and so on. The second worksheet (or the custom SQL that retrieves data directly from your data warehouse) will provide ID and sum of giving since September 1.

Normally in report creation, Tableau handles all the aggregation of the data — the input is raw transaction data, with each ID potentially appearing on multiple rows. In this example, however, we have aggregated the data already (summing giving by ID), and there is only one row of data for each ID. It doesn’t matter, but it might have implications for some of the specific steps that follow.

You should refer to your Tableau references for connecting to data sources. All I will add is that when you add the table (or worksheet) that contains the giving data, be sure to left-join on ID, because obviously not everyone you have scored has given since Sept. 1. From here on in, I will use Tableau terminology that won’t make any sense if you don’t know the software (specifically, Tableau Desktop version 7.0). Let’s build our first view:

  1. If your data has been extracted correctly, ‘ID’ will be listed under Dimensions, and your two model score sets will be listed under Measures. In this example, I will from now on refer to them as MLR (for Multiple Linear Regression) and Logistic. Obviously I’m referring to my own data — just try to replicate what I’m talking about using your own data file.
  2. For now, pause Auto Updates (or turn off automatic updates).
  3. Right-click on Logistic and select “Create bins …” This will bin the percentile score into whatever size we desire. Change the default bin size to 5 and click OK. Note that a new variable is created in the Dimensions pane, because bins are categorical, not numerical.
  4. Right-click on MLR and do the same thing.
  5. Drag Logistic (bins) to the Columns shelf. Drag MLR (bins) to the Rows shelf.
  6. Drag ID to the Text shelf. Click on the down-arrow of the ID pill you’ve just created, and select Measure –> Count. This will create a count of all IDs that fall into each cell. It turns green to indicate it’s now a measure instead of a dimension. (Because each ID appears in our data only once, it doesn’t matter whether we use either Count or Count Distinct.)
  7. Change the Marks type from Automatic to Square (right above the Text shelf). Notice that the Text shelf suddenly turns into a Label shelf — each square of the heat map will be labeled with the number of IDs.
  8. Drag ID from the Dimensions pane again, and this time drop it onto the Color shelf.
  9. Click on the down-arrow of the ID pill you’ve just created, and select Measure –> Count. This will base the color or shading of the cell on the number of IDs that fall into that cell.

The top left corner of your screen will look like this:

Now we’re ready to allow the view to automatically update. The result won’t look much like a heat map: Probably just a bunch of little squares with numbers beside them. We need to enlarge the squares. Under the Size shelf is a slider: Move this to the centre of the size range. Then drag one of the rows in the view to make it taller — hover over the axis for MLR (on the far left) until the pointer turns into an up-and-down arrow, then click and drag. When you let go, the squares will resize and the alleys of white space should start to close up. Keep messing with it until the squares touch on all sides. With a little formatting of labels for readability, the final product will look something like this. (Click on thumbnail image for full size.)

A heat map can convey a lot of information at a glance. You can immediately see where a lot of individuals are concentrated: They’re in the darkest squares. The numbers are hard to read, but up in the top left of the map, we see that the number of people who fall into the 0-4 bin in both the MLR and Logistic models is 572. In the lower right area of the map, we see that 563 people fell into the 95 to 99 bin in both models. Notice that Tableau didn’t bin evenly: Every single bin has 5 score levels in it except for the bin labeled 100, which contains only individuals with a score of 100. In the map, we see that 147 people scored exactly 100 in both models. This can be corrected (using a calculated field instead of automatic binning), but I have decided to leave it the way it is. Due to the nature of this modeling exercise, I am mainly interested in the top few percentile scores anyway, and the 100 group is of particular interest. Having them mapped separately from the rest is not a problem.

The names of the bins don’t reflect what they include. For example, “90” really means “90 to 94”. You can rename them using aliases. Right-click on Logistic in the Dimensions pane, select Field Properties –> Aliases…, and change the displayed values in the Values column. Do the same for MLR.

We haven’t looked at the recent-gift data yet, but before we move on, what can we learn from this view? It appears the models agree on the individuals with extremely high or extremely low scores. In the middle range, there is still a lot of agreement but also many more cases of divergence, in which an individuals scores high in one model but low in the other. This is clear, at-a-glance evidence that our models are similar but different. Depending on the application, choosing one model over the other could have a big effect on the result, for better or worse. In this particular application, where I am interested mainly in very high-scoring alumni only, it may not make that much difference at all … but let’s not jump to that conclusion just yet.

If your data set included some key grouping information such as age or sex, it might be interesting to create a filter to examine whether the models differ on those factors. Here’s an example with ‘Age’:

  1. Drag Age from the Measures pane into the Filters shelf.
  2. When Tableau asks you how you want to filter on Age, select “All Values” and click Next.
  3. On the next box, select Range of Values, and click OK.
  4. Hover over the green Age pill on the filters shelf, click the down-arrow on the right end of the pill, and select Show Quick Filter.

Now you can set the upper and lower age bounds of the individuals you want to be counted in the heat map. As you slide the scale, it will display Age with numbers after the decimal, even though your values are all whole numbers. If this bothers you, right-click on Age in the Measures pane, select Field Properties –> Number Format…, and click on Number (Custom). Adjust the number of decimal places to zero. Here’s what the quick filter looks like:

The next two images show the heat map for different age ranges. The first one is ages 20 to 50, the second is 51 to 80. Again, click on the thumbnails for full-size images — although the beauty of a heat map is that you can see the pattern from a distance.

Right off the bat, it’s evident that it’s harder for younger individuals to get a high score, but they fare better in the MLR model than they do in the Logistic model. Imagine a 45-degree line sloping from the top left corner to the bottom right corner — the presence of more dark-shaded squares under that line indicates individuals with higher MLR scores than Logistic scores. The logistic model, on the other hand, slightly favours older alumni. This alone might explain why the Logistic model outperformed the MLR model in terms of the validation set. The difference might be due to how age-related variables were introduced to each model as predictors; they may have been more influential in one than the other. It’s hard to say without going back to the models themselves for a close look.

One can spend a lot of time playing and learning with these filters. Let’s fast-forward and (finally) introduce recent-gift data — the giving that all scored individuals have engaged in since September 1, the day after the models were supposedly created. This data appears in the Measures pane as a variable I’ll call ‘Sum of Giving’. I’m specifically interested in who has given at least $1,000 (cumulatively), so I will need to create a calculated field to flag these people.

  1. Right-click on Sum of Giving and select Create Calculated Field…
  2. Give the field a name. I called it “Leadership donor”.
  3. The field Sum of Giving is already in the expression window. Now you just need to add some text around it to complete the expression:
  4. Click OK. This creates a field (variable) with the value 1 for any donor who has given at the Leadership level, and nothing if otherwise. Note that you can enter any amount in place of 999. If you want to count donors vs. non-donors, enter “>0”.
  5. The field appears in the Measures pane, because Tableau recognizes it as numeric. We’re using it as a categorical variable, so let’s convert it into a Dimension instead. Right-click on the field name and select “Convert to Dimension”, or simply drag the field into the Dimensions pane — both actions accomplish the same thing.

Now we have a flag we can use to zero in on our higher-end donors. Let’s create a new view for that. At the lower left of your screen, right-click on the tab for the existing view and select “Duplicate Sheet”. This will allow us to continue exploring the heat map without changing our original version. We could, of course, do all our work in a single view and use filters to dynamically alter the view — that’s one of the strengths of Tableau — but for now let’s keep our views separate.

  1. If you still have filters applied for Age or other variables, click on the quick filter menu and select “Clear Filter”. You can reapply it later if you want — we’re just getting it out of the way so we can see the full picture.
  2. Drag ‘Leadership donor’ to the Filters shelf.
  3. In the box that pops up, click “Select from list” on the General tab (it should already be selected), and then check the little box for ‘1’.
  4. Click OK.

The result looks like this. (Click for full size.)

Our big donors are clustered nicely down in the lower right corner, where both the MLR and the Logistic model scores are very high. Some of the lower-score bins contain zero Leadership donors, and Tableau has automatically hidden those rows and columns from view. Take a couple of minutes to study the map. Follow the three darkest squares (labeled 48, 74, and 23) as they form a 45-degree line up the centre of the map. If you compare the values in the squares that are directly opposite each other over this line, you’ll notice that there are slightly more Leadership donors on the upper side of the line. Those are donors who have higher Logistic scores than MLR scores. As well, notice that the scattered cloud of donors above the line is more extensive than that below the line. These observations should lead us to believe that the Logistic model performs slightly better than the MLR model.

That conclusion is a bit hasty, though. There might be more Leadership donors on the high-Logistic/low-MLR side simply because more alumni ended up in those squares in the first place. We need to calculate the PERCENTAGE of the population of each square that went on to become a Leadership donor. That’s right, we’re going to create a third view, and calculate percentages to plug into each square.

  1. Right-click on the tab for Sheet 2 and select Duplicate Sheet. (By the way, you can name these sheets whatever you want, just as in Excel.)
  2. Remove the filter for Leadership donor.
  3. Under Analysis in the top menu bar, select Create Calculated Field…
  4. Name the new field ‘Leadership percentage’.
  5. Enter this expression, which divides the number of Leadership donors by the total number of individuals.
  6. Click OK. The new field appears in the Measures pane, which is fine.
  7. Drag ‘Leadership percentage’ from Measures onto the Label shelf, replacing the count of ID.
  8. Drag ‘Leadership percentage’ from Measures again, this time onto the Color shelf.
  9. Right-click on any square in the map, and select Format…, which opens a formatting pane at the far left.
  10. On the Pane tab, in the Default section, click on the down-arrow to the right of “Numbers”, and select Percentage.

The result is below. (Click for full size.) You can select any precision for your percentages — I’ve rounded to whole numbers to avoid clutter.

The darkest square is a single donor with a very high MLR score but a very low Logistic score, who just happened to give at the Leadership level. That square is of course labeled 100%, which causes the rest of the display to be toned down to a degree that makes it hard to see the patterns. This single donor might be a person to look at more carefully, but for now, let’s exclude that person from the map. Hover your pointer over the square, and select Exclude from the tooltip box. (This creates a specific filter for this individual, which you can remove anytime.) All the squares are recoloured accordingly:

Now some of the darkest squares are also based on very sparse data. You can exclude any that you wish, but I’m fine with this display for now. For one thing, we can clearly see that having a Logistic score of 95 or higher is darn significant, regardless of what a donor’s MLR score is. For example, there are four Leadership donors who scored only 65-69 in the MLR model but have Logistic scores of 95-99, which is what we want to see. (Those donors are in the square labeled 14%.)

Being able to demonstrate that one model is superior is pretty nifty. But I am especially intrigued at how easy it is to see how the models might work together to improve accuracy.

Have a look at the square containing individuals who scored 100 in both models. There were 147 such individuals, and 48 of them gave $1,000 or greater — a whopping 32.6%. Here are a couple of facts to think about:

  • Of all the individuals who scored 100 in the Logistic model, 26.7% went on to give at the Leadership level.
  • Of all the individuals who scored 100 in the MLR model, 23.1% went on to give at the Leadership level.

Do you see what I’m getting at? When we combine both scores and zero in on people in the top percentile for both models, our yield of Leadership donors increases by nearly six percentage points over the best-performing model, to 32.6%.

The same boost is evident for other high-scoring cells in the heat map: The logistic model identifies some big donors that the MLR model misses, but the MLR model can enhance the accuracy of the logistic model. This is potentially useful for prospect identification in Major Giving, when we really want to be as focused as possible.

So far I’ve shown you only donor numbers. What about revenue? Our data set includes gift amounts, so let’s create a new view to visualize actual aggregate dollar totals.

  1. Duplicate the last sheet you created, and remove any filters that had been applied.
  2. Drag ‘Sum of Giving’ to the Label and Color shelves.
  3. Format the values as currency.
  4. For fun, change the color from green to red by clicking on Edit Colors in the context menu for the Sum of Giving card.

The result is pretty dramatic.

This is for all donors, not just Leadership donors, but if you want to narrow it down to Leadership donors only, re-apply your filter.

Just as with raw donor counts, the view above is a little misleading, simply because more prospects equals more donors, equals more dollars. So let’s create a calculated field to give us AVERAGE dollars per donor for every cell in the heat map.

The individuals with scores of 100 in both models gave nearly $5,000 on average — no other cell comes close. But guess what’s even better:

  • The individuals who scored 100 in the Logistic model gave an average of $2,927.
  • The individuals who scored 100 in the MLR model gave an average of $2,971.

The models are strongest where they intersect!

I’ve spent a lot of time and more than 4,000 words explaining how to do this in Tableau. This is very unusual for me — why a specific product such as Tableau, when one can create heat maps even in Excel? *

  • It’s just so easy to do it in Tableau, and the result looks attractive without requiring the user to fuss with formatting.
  • The data can be refreshed whenever necessary. If you’re connecting to an Excel file, simply paste new data into the file and refresh the data extract. It’s that simple. (Remember to refresh the extract rather than replace the data source entirely, if you want to retain your aliases as you’ve defined them.)
  • That goes for refreshing the giving data, AND for loading a whole different set of individuals and scores. You don’t need to rebuild these views from scratch (although it’s pretty easy to do so).
  • Tableau allows you to dynamically filter the data any which way you want. It’s a great way to explore the data. In my example, it would have been really interesting to filter on donors who UPGRADED to the $1,000+ level. Which model did a better job predicting upgrading? I don’t know, but I’m going to find out.
  • You can drill down to the underlying data. If you want to see a list of the people who scored 100 in both models, just hover the pointer over that square and click on the data icon, then the ‘Underlying’ tab. Imagine having wealth/capacity scores on one axis, and propensity scores on the other …
  • I’ve shared my heat maps here as static images, but you can share your analyses as fully-functioning views, even with people who don’t have the software on their computers. Save it as a Packaged Workbook, and they’ll be able to open it in Tableau Reader (which they can download for free). They can use the filters you’ve set up to play with the data themselves.

This may be the longest CoolData post ever, but as usual I feel I am barely scratching the surface.

* P.S.: Heat maps are easily created in a combination of Data Desk and Excel. Without going into too much detail: In Data Desk use contingency tables (a.k.a. cross tabs) to create the basic matrix of numbers, with one score set as x and the other as y, and use derived variable expressions to limit the counts as desired. Copy and paste the table text into Excel, and use conditional formatting to create the desired shading. Unfortunately this requires some fussing and the result is static.

18 April 2012

Stepwise, model-foolish?

Filed under: Model building, Pitfalls, regression, Software, Statistics — Tags: , — kevinmacdonell @ 8:00 am

My approach to building predictive models using multiple linear regression might seem plodding to some. I add predictor variables to the regression one by one, instead of using stepwise methods. Even though the number of predictor variables I use has greatly increased, and the time needed to build a model has lengthened, I am even less likely to use stepwise regression today than I was a few years ago.

Stepwise regression, available in most stats software, tosses all the predictor variables into the analysis at once and picks the best for you. It’s a semi-automated process that can work forwards or backwards, adding or deleting variables until it’s satisfied a statistical rule of thumb. The software should give you some control over the process, but mostly your computer is making all the big decisions.

I understand the allure. We’re all looking for ways to save time, and generally anything that automates a repetitive process is a good thing. Given a hundred variables to choose from, I wouldn’t be surprised if my software was able to get a better-fitting model than I could produce on my own.

But in this case, it’s not for me.

Building a decent model isn’t just about getting a good fit in terms of high R square. That statistic tells you how well the model fits the data that the model was built on — not data the model hasn’t yet seen, which is where the model does its work (or doesn’t). The true worth of the model is revealed only over time, but you’re more likely to succeed if you’ve applied your knowledge and judgement to variable selection. I tend to add variables one by one in order of their Pearson correlation with the target variable, but I am also aware of groups of variables that are highly correlated with each other and likely to cause issues. The process is not so repetitive that it can always be automated. Stepwise regression is more apt to select a lot of trivial variables with overlapping effects and ignore a significant predictor that I know will do the job better.

Or so I suspect. My avoidance of stepwise regression has always been due to a vague antipathy rather than anything based on sound technical concerns. This collection of thoughts I came across recently lent some justification of this undefined feeling: Problems with stepwise regression. Some of the authors’ concerns are indeed technical, but the ones that resonated the most for me boiled down to this: Automated variable selection divorces the modeler from the process so that he or she is less likely to learn things about their data. It’s just not as much fun when you’re not making the selections yourself, and you’re not getting a feel for the relationships in your data.

Stepwise regression may hold appeal for beginning modellers, especially those looking for push-button results. I can’t deny that software for predictive analysis is getting better and better at automating some of the most tedious aspects of model-building, particularly in preparing and cleaning the data. But for any modeller, especially one working with unfamiliar data, nothing beats adding and removing variables one at a time, by hand.

15 July 2011

Answering questions about “How many times to keep calling”

Filed under: Annual Giving, John Sammis, Model building, Peter Wylie, Phonathon, regression — kevinmacdonell @ 8:27 am

The recent discussion paper on Phonathon call attempts by Peter Wylie and John Sammis elicited a lot of response. There were positive responses. (“Well, that’s one of the best things I’ve seen in a while. I’m a datahead. I admit it. Thank you for sharing this.”) There were also many questions, maybe even a little skepticism. I will address some of those questions today.

Question: You discuss modeling to determine the optimum number of times to call prospects, but what about the cost of calling them?

A couple of readers wanted to know why we didn’t pay any attention to the cost of solicitation, and therefore return on investment. Wouldn’t it make sense to cut off calling a segment once “profitability” reached some unacceptably low point?

I agree that cost is important. Unfortunately, cost accounting can be complicated even within the bounds of a single program, let alone compared across institutions. In my own program, money for student wages comes from one source, money for technology and software support comes from another, while regular expenses such as phone and network charges are part of my own budget. If I cannot realize efficiencies in my spending and reallocate dollars to other areas, does it makes sense to include them in my cost accounting? I’m not sure.

And is it really a matter of money? I would argue that the budget determines how many weeks of calling are possible. Therefore, the limiting factor is actually TIME. Many (most?) phone programs do little more than call as many people as possible in the time available. They call with no regard for prospects’ probability of giving (aside from favouring LYBUNTs), spreading their limited resources evenly over all prospects — that is, suboptimally.

The first step, then, is to spend more time calling prospects who are likely to answer the phone, and less time calling prospects who aren’t. ROI is important, but if you’re not segmenting properly then you’re always going to end up simultaneously giving up on high-value prospects prematurely AND hanging on to low-value prospects beyond the limit of profitability.

Wylie and Sammis’s paper provides insight into a way we might intelligently manage our programs, mainly by showing a way to focus limited resources, and more generally by encouraging us to make use of the trove of data generated by automated calling software. Savvy annual fund folks who really have a handle on costs and want to delve into ROI as well should step up and do so — we’d love to see that study. (Although, I have to say, I’m not holding my breath.)

Question: Which automated calling software did these schools use?

The data samples were obtained from three schools who use the software of a single vendor, and participants were invited via the vendor’s client listserv. The product is called CampusCall, by RuffaloCODY. Therefore the primary audience of this paper could assume that Wylie and Sammis were addressing auto dialers and not predictive dialers or manual programs. This is not an endorsement of the product — any automated calling software should provide the ability to export data suitable for analysis.

By the way, manual calling programs can also benefit from data mining. There may be less call-result data to feed back into the modeling process than there would be in an automated system, but there is no reason why modeling cannot be used to segment intelligently in a manual program.

If you have a manual program and you’re calling tens of thousands of alumni — consider automating. Seriously.

Question: What do some of these “call result” categories mean?

At the beginning of the study, all the various codes for call results were divided into two categories, ‘contact made’ and ‘contact not made’. Some readers were curious about what some of the codes meant. Here are some of the codes that have meanings which are not obvious. None of these are contacts.

  • Re-assigned: The phone number has been reassigned to a new person. The residents at this phone number do not know the prospect you are attempting to reach.
  • FAX2: The call went to a fax, modem or data line for the second time — this code removes the number from more calling.
  • Hung up: This is technically a contact, but so often the caller doesn’t know if the prospect answered (or someone else in the household), and often the phone is hung up before the caller can introduce him/herself, in which case the encounter doesn’t meet the definition of a contact, which is an actual conversation with the prospect. So we didn’t count these as contacts.
  • Call back2: The prospect or someone else in the household asks to be called back some other time, but if this was the last result code, no future attempt was made. Not a contact.
  • NAO: Not Available One Hour. The prospect can’t come to the phone, call back in an hour — but obviously the callback didn’t happen, because NAO is still the last result.

Question: Why did you include disconnects and wrong numbers in your analysis? Wouldn’t you stop calling them (presumably after the first attempt), regardless of what their model score was? A controlled experiment would seem to call for leaving them out, and your results might be less impressive if you did so.

Good point. When a phone number proves invalid (as opposed to simply going to an answering machine or ringing without an answer), there’s no judgement possible about whether to place one more call to that number. Regardless of the affinity score, you’re done with that alum.

If we conducted a new study, perhaps we would exclude bad phone numbers. It’s my opinion that rerunning the analysis would be more of a refinement on what we’ve learned here, rather than uncovering something new. I think it’s up to the people who use this data in their programs to take this new idea and mine their own data in the light of it — and yes, refine it as well.

This was not a controlled experiment, by the way. This was a data-mining exploration which revealed a useful insight which, the authors hope, will lead to others digging into their own call centre data. True controlled experiments are hard to do — but wouldn’t it be great if fundraisers would collaborate with the experts in statistics and experimental design teaching on their own campuses?

Question: What modeling methods did you use? Did you compare models?

The paper made reference to multiple linear regression, which implies that the dependent variable is continuous. The reader wanted to know if the modeling method was actually logistic regression, or if two or more models were created and compared against a holdout sample.

The outcome variable was in fact a binary variable, “contact made”. Every prospect could have only two states (contacted / not contacted), because each person can be contacted only once. The result of a contact might be a pledge, no pledge, maybe, or “do not call” — but in any case, the result is binary.

(Only one model was created and there was no validation set, because this was more of an exploration to discover whether doing so could yield a model with practical uses, rather than a model built to be employed in a program.)

Although the DV was binary, the authors used multiple regression. A comparison of the two methods would be interesting, but Wylie and Sammis have found that when the splits for the dependent variable get close to 50/50 (as was the case here), multiple linear regression and logistic regression yield pretty much the same results. In the software package they use, multiple regression happens to be far more flexible than logistic, changes in the fit of the model as predictors are swapped in and out are more evident, and the output variable is easier to interpret.

Where the authors find logistic regression is superior to multiple regression is in building acquisition or planned giving models where the 0/1 splits are very asymmetric.

Question: Why did you choose to train the model on contacts made instead of pledges made?

Modeling on “contact made” instead of on “pledge made” is a bit novel. But that’s the point. The sticking point for Phonathon programs these days is simply getting someone to pick up the phone. If that’s the business problem to be solved, then (as the truism in data mining goes), that’s how the model should be focused. We see the act of answering the phone as a behaviour distinct from actually making a pledge. Obviously, they are related. But someone who picks up the phone this year and says “no” is still a better prospect in the long run than someone who never answers the call. A truly full-bodied segmentation for Phonathon would score prospects on both propensity to answer the phone and propensity to give — perhaps in a matrix, or using a multiplied score composed of both components.

Question: I don’t understand how you decided which years to include in the class year deciles. Was it only dividing into equal portions? That doesn’t seem right.

Yes, all the alumni in the sample were divided into ten roughly equal groups (deciles) in order by class year. There was no need to make a decision about whether to include a particular year in one decile or the other: The stats software made that determination simply by making the ten groups as equal as possible.

The point of that exercise was to see whether there was any general (linear) trend related to the age of alumni. In the study, the trend was not a straight line, but it was close enough to work well in the model — in general, the likelihood of answering the phone increases with age. Dividing the class years into deciles is not strictly necessary — it was done simply to make the relationship easier to find and explain. In practice, class year (or age) would be more likely to be placed into the regression analysis as-is, not as deciles.

BUT, Peter Wylie notes that the questioner has a point. Chopping ‘class year’ into deciles might not be the best option. For example, he says, take the first decile (the oldest alums) and the tenth decile (the youngest alums): “The range for the former can easily be from 1930-1968, while the range for the latter is more likely to be 2006-2011. The old group is very heterogeneous and the young group is very homogeneous. From the standpoint of clearly seeing non-linearity in the relationship between how long people have been out of school and giving, it would be better to divide the entire group up into five-year intervals.” The numbers of alumni in the intervals will vary hugely, but it also might become more apparent that the variable will need to be transformed (by squaring or cubing perhaps) before placing it into the regression.

Another question about class year came from a reader at an institution that is only 20 years old. He wanted to know if he could even use Class Year as a predictor. Yes, he can, even if it has a restricted range — it might still yield a roughly linear trend. There is no requirement to chop it into deciles.

A final word

The authors had hoped to hear from folks who write about the annual fund all the time (but never mention data driven decision making), or from the vendors of automated calling software themselves. Both seem especially qualified to speak on this topic. But so far, nothing.

5 April 2011

Validation after the fact

Filed under: Model building, Phonathon, regression, Validation — Tags: , , — kevinmacdonell @ 8:11 am

Validation against a holdout sample allows us to pick the best model for predicting a behaviour of interest. (See Thoughts on model validation.) But I also like to do what I call “validation after the fact.” At the end of a fundraising period, I want to see how people who expressed that behaviour broke down by the score they’d been given.

This isn’t really validation, but if you create some charts from the results, it’s the best way to make the case to others that predictive modeling works. More importantly, doing so may provide insights into your data that will lead to improvements in your models in their next iterations.

This may be most applicable in Annual Fund, where the prospects solicited for a gift may come from a wide range of scores, allowing room for comparison. But my general rule is to compare each score level by ratios, not counts. For example, if I wanted to compare Phonathon prospects by propensity score, I would compare the percentage (ratio) of each score group contacted who made a pledge or gift, not the number of prospects who did so. Why? Because if I actually used the scores in solicitation, higher-scoring prospects will have received more solicitation attempts on average. I want results to show differences among scores, not among levels of intensity of solicitation.

So when the calling season ended recently, I evaluated my Phonathon model’s performance, but I didn’t study the one model in isolation: I compared it with a model that I initially rejected last year.

It sounds like I’m second-guessing myself. Didn’t I pick the very best model at the time? Yes, but … I would expect my chosen model to do the best job overall, but perhaps not for certain subgroups — donor types, degree types, or new grads. Each of these strata might have been better described by an alternative model. A year of actual results of fundraising gives me what I didn’t have last year: the largest validation sample possible.

My after-the-fact comparison was between a binary logistic regression model which I had rejected, and the multiple linear regression model which I actually used in Phonathon segmentation. As it turned out, the multiple linear regression model did prove the winner in most scenarios, which was reassuring. I will spare you numerous comparison charts, but I will show you one comparison where the rejected model emerged as superior.

Eight percent of never-donor new grads who were contacted made a pledge. (My definition of a new grad was any alum who graduated in 2008, 2009, or 2010.) The two charts below show how these first-time donors broke down by how they were scored in each model. Due to sparse data, I have left out score level 10.

Have a look, and then read what I’ve got to say.

Neither model did a fantastic job, but I think you’d agree that predicting participation for new grads who have never given before is not the easiest thing to do. In general, I am pleased to see that the higher end of the score spectrum delivered slightly higher rates of participation. I might not have been able to ask for more.

The charts appear similar at first glance, but look at the scale of the Y-axis: In the multiple linear regression model, the highest-scoring group (9, in this case) had a participation rate of only 12%, and strangely, the 6th decile had about the same rate. In the binary logistic regression model, however, the top scoring group reached above 16% participation, and no one else could touch them. The number of contacted new grads who scored 9 is roughly equal between the models, so it’s not a result based on relatively sparse data. The BLR model just did a better job.

There is something significantly different about either new grads, or about never-donors whom we wish to acquire as donors, or both. In fact, I think it’s both. Recall that I left the 10s out of the charts due to sparse data — very few young alumni can aspire to rank up there with older alumni using common measures of affinity. As well, when the dependent variable is Lifetime Giving, as opposed to a binary donor/nondonor state, young alumni are nearly invisible to the model, as they are almost by definition non-donors or at most fledgling donors.

My next logical step is a model dedicated solely to predicting acquisition among younger alumni. But my general point here is that digging up old alternative models and slicing up the pool of solicited prospects for patterns “after the fact” can lead to new insights and improvements.

22 March 2011

Thoughts on model validation

Filed under: Model building, Pitfalls, regression, Validation — Tags: , — kevinmacdonell @ 11:46 am

I have written very little about model validation on CoolData, probably because I’ve always had conflicting thoughts about it.

Model validation is important for avoiding serious error when predicting behaviours that are relatively rare, such as major or planned giving. But if the event is rare, the data is sparse. When I need to train the data on a relatively small number of cases, I am loathe to rob my model of half the cases I need for training.

In an annual fund model it’s no big deal. If the predicted value is participation in the fund, that’s not a rare event and there is plenty of data. There’s no angst about going the prescribed route: splitting the file into two random halves before modeling begins — a training sample used to calculate the scores, and a validation sample on which to test the validity of those scores. The irony is, most annual fund models are robust and hardly require validation.

It’s exactly when doing the tricky modeling projects (major giving, planned giving) that validation is most important. If a major-giving model proves valid, that’s great — but it might have been even stronger had I kept the data file intact. If it proves faulty, how much of that can be blamed on the fact that half my cases are unavailable?

Validation of a model has very little to do with R squared, by the way. That statistic measures only how well a multiple linear regression model fits the data set you’re working with. It doesn’t tell you how well it will perform for prediction. A very high (60% or more) or very low (10% or less) value for R squared signals trouble with the design of your model; most models will fall into that safe zone, regardless of their usefulness for prediction.

My thinking is this. Validating a model has one of two results: Either you accept the model, or you throw it out. Unless you’ve got other scoring or segmentation tricks up your sleeve, the alternative to having a predictive model is using nothing at all. How often does implementing a model prove worse than having no model? I am willing to wager: almost never. Even a score set that contains a certain amount of random noise is an improvement over not using any score. If you’re going to build a single model and use it come what may, flaws and all, why bother to validate it? Not to mention that there is no clear dividing line between a valid model and an invalid one.

The key is that I do not create one model, but several. Creating more than one model gives me options. I no longer have to compare the validity of my model against some arbitrary rule of thumb — I can compare multiple models against each other and choose the best one. Instead of splitting my data file in half and losing half of my precious training cases, I set aside a small but reasonably representative random sample of cases. This sample is held out of the model, but also scored, allowing me to compare the relative success of each model.

The benefits are twofold: One, I don’t lose nearly as many cases to the validation set, and two, I can still get some idea about how good the model is. If all the models are lousy, it’s going to show up in how the holdout cases are distributed by score.

Here is an example. Recently I needed to develop a model to predict propensity to give at higher levels. The table below summarizes how I set out to build two very different models.

Without going into too much detail, you can see that I could have chosen to build a number of different models by mixing and matching the criteria available to me:

  1. binary logistic regression or multiple linear regression?
  2. include everyone in the model (even non-donors), or a narrowed-down selection of individuals?
  3. binary dependent variable (set to represent any level of giving), or a continuous dependent variable (LT giving)?
  4. giving-related variables included or excluded?

However, in the interest of time and other practical concerns I chose only two contrasting options. As well, some choices are made for me: If I want to score non-donors in my model, I am obliged to leave out all giving-related variables, for example. (I’ve written plenty about these scenarios recently, so I won’t get into that again.)

The one thing held constant between the models I built was the holdout sample. I chose 10 current major donors at random and held them out of both models. The models were innocent of these donors’ status, but used their characteristics to assign scores to them anyway. So whichever model did a better job of giving high scores to this sample was the one I chose to use in our program.

Only ten individuals? Yes, that’s a tiny holdout sample, but when you’re trying to model for propensity to give at an exclusive level, you need to conserve your precious training cases. As it turned out, ten holdout cases was enough to reveal differences in reliability. I was surprised at which model ended up winning the trophy — but that’s a post for another time.

« Newer PostsOlder Posts »

Blog at