# CoolData blog

## 7 September 2010

### Making the dependent variable more dependable

Filed under: Model building, regression — Tags: , — kevinmacdonell @ 10:54 am

## Guest post by Kate Chamberlin and Michelle Paladino, Office of Development, Memorial Sloan-Kettering Cancer Center, New York NY

The most important variable doesn’t always get the attention it deserves. (Creative Commons license. Click image for source.)

It’s an old statistics joke that when building a predictive model, you spend almost all of the time slaving over the data, only so that at the end of the slog, you get to press a button for the actual fun part. Cleaning data, imputing missing values, and restructuring, along with ceaseless contemplation about new and improved independent variables is how we expend much of our energy and rightly so. However, the most important variable doesn’t always get the attention it deserves.

All predictive models center on a particular population, the dependent variable – so giving the target a little extra TLC goes a long way. It’s a balancing act between size and purity. The larger the target size, the more statistical reliability you have, but the more precise the target definition, the better you are able to isolate the behavior that you’re trying to predict. An example is that corporations, foundations, and estates behave differently than individuals. Therefore, if your goal is to find individuals who have a high likelihood to make a major gift, clearing out those estates, foundations and corporations, even if they have given the target amount, will lead to a more trustworthy dependent variable.

And what is the best target cutoff amount for a major gift, anyway? In our Development Office, the Major Gifts program starts at \$50,000. A binary dependent variable with the 1’s defined as individuals who have made a \$50,000+ gift is perfectly reasonable and works just fine, but is this cutoff meaningful from a donor’s perspective? And what about timing – do major donors of long ago look the same as those who have given more recently? We have yet to find the definitive answers, but checking to see if the independent variable distributions change dramatically with different targets and running models with a few flavors of target populations is a good way to evaluate if these changes make a difference.

Another method that can help you more clearly define the dependent variable is to consider to which donors you will be applying your model scores. For instance, if you work in a strictly donor database, as we do, and you are modeling for major donors, it is a good idea to exclude from your target those who came onto the file at the major giving target amount. In other words, remove the individuals whose first gift was \$50,000+ because if the scores will be applied to donors who are giving below the target right now, then your dependent variable should only include a population that gave below the target level and then jumped up to the target amount.

But when does the pruning of your target go too far? If it becomes too small, then the performance of a few donors can have a big effect. A minimum sample size of 30 is a magic rule-of-thumb that is mentioned regularly in the classroom.  If we were to approach that number in our dependent variable, we would be likely to redefine our target to increase the sample size.  In the example above, we might choose to lower the major gift threshold to \$25,000.  We’d definitely be interested to hear about less “magical” methods you might use to determine a lower bound for your target sample size!

In the end, murkiness is unavoidable, but the idea is to have a target variable look as much as possible like the future population you will be scoring. So, as you tend to your unruly independents, don’t forget about that seemingly well-behaved fellow, the dependent variable, because he is actually the leader of the bunch!

Kate Chamberlin leads a small analytics group at Memorial Sloan-Kettering Cancer Center. She came to Sloan- Kettering in fall 2006 from Columbia University, where she was a research analyst and writer for the university’s corporate and foundation relations office. Kate has also served as an events manager for Columbia’s principal gifts group, and a grant writer at Arts Horizons, a small arts education agency. She holds a bachelor’s degree in theater directing and design from Dartmouth College, and an MBA focusing on economics and strategy from Columbia Business School.

Michelle Paladino is part of the Memorial Sloan-Kettering Cancer Center’s growing analytics group. She develops predictive models and applies other advanced techniques to analyze donor behavior and measure program performance. Previously, Michelle was one of the one of the Center’s fundraising officers. She holds a bachelor’s degree in political science and a master’s degree in public policy from New York University.

## 18 March 2010

### My Planned Giving model growing pains

Filed under: Model building, Planned Giving, regression — Tags: , , — kevinmacdonell @ 8:22 am

People stumbling on CoolData might assume that I think I’ve gathered unto myself some great corpus of data mining knowledge and that now I presume to dispense it via this blog, nugget by nugget.

Uh, well – not quite.

The reality is that I spend a lot of my time at work and at home surrounded by my books, struggling to get my arms around the concepts, and doing a good deal of head-scratching. Progress is slow, as only about ten percent of my work hours are actually spent on data mining. Questions from CoolData readers are cause for anxiety more than anything else. (Questions are welcome, of course, but sometimes advice would be better.)

As a consequence, I proceed with caution when it comes to building models for my institution. I don’t have a great deal of time for testing and tweaking, and I steer clear of creating predictive score sets that cannot be deployed with a high level of confidence.

This caution has not prevented me from having some doubts about the model I created last year for our Planned Giving program, however.

This model sorted all of our alumni over a certain age into percentile ranks according to their propensity to engage with our institution in a planned giving agreement. Our Planned Giving Officer is currently focused on the individuals in the 97th percentile and up. Naturally, whenever a new commitment (verbal or written) comes across the transom (unsolicited, as I think PG gifts often are), the first thing I do is check the individual’s percentile score.

A majority of the new expectancies are in the 90s, which is good, and most of those are 97 and up, which is better. When I look at the Annual Giving model scores for these same individuals, however, I see that the AG scores do a better job of predicting the Planned Giving donors than the PG scores do. That strikes me as a bit odd.

Planned Giving being a slowly-evolving process, there aren’t enough examples of new commitments to properly evaluate the model, to my satisfaction at least. But when model-building time comes around again in July and August, I’ll be making some changes.

The central issue I faced was that current commitments numbered only a little over 100. That’s not a lot of historical data to model on. I asked around for advice. One key piece of advice was to cut down on the size of the prospect pool by excluding all alumni younger than our youngest current commitment. Done.

My primary interest, though, was to somehow legitimately boost the number of examples of PG donors, in order to beef up the dependent variable in a regression analysis.

Some institutions, I learned, tried to do this by digging into data on deceased planned giving donors, going back five or ten years. (I hope I do not strain decorum with the verb I’ve selected.) Normally we model only on living individuals, but having access to more examples of this type of donor has proven helpful for some. Unfortunately, on investigation I found that the technical issues involved made it prohibitively time-consuming: For various reasons, I would have had to perform many separate queries of the database in order to get at this data and merge it with that of the living population.

As luck would have it, though, around this time we received all the data from a huge, wide-ranging survey of alumni engagement we had conducted that March. One of the scale statements was specifically focused on attitudes towards leaving a bequest to our institution. The survey was non-anonymous, and a lot of positive responders to this statement were in our target age range. Bingo – I had a whole new group of “PG-oriented” individuals to add to my dependent variable. The PG model would be trained not only on current commitments, but on alumni who claimed to be receptive to the idea of planned giving.

In addition, I had the identities of a number of alumni who had attended information sessions on estate planning organized by our Planned Giving Officer.

I think all was well up to that point. What I did after that may have led to trouble.

I thought to myself, these PG-oriented people are not all of the same “value”. Surely a written gift commitment is “worth more” than a mere online survey response clicked on in haste. So I structured my dependent variable to look like this, using completely subjective ideas of what “value” ought to be assigned to each type of person:

• Answered “agree” to the PG statement in survey: 1 point
• Answered “strongly agree” to the PG statement in survey: 2 points
• Attended an estate planning session: 3 points
• Has made a verbal PG commitment: 6 points
• Has a written commitment in place: 8 points

Everyone else in the database was assigned a zero. And then I used multiple regression to create the model.

This summer, I think I will tone down the cleverness with my DV.

First of all, everyone with a pro-PG orientation (if I can put it that way) will be coded “1”. Everyone else will be coded “0”, and I will try using logistic regression instead of multiple regression, as more appropriate for a binary DV.

Going back to the original model, it occurs to me that my method was based on a general misconception of what I was up to. In creating these “levels of desirability,” I ignored the role of the Planned Giving Officer. My job, as I see it now, is to deliver up the segment of alumni that has the highest probability of receptivity to planned giving. It’s the PGO’s task to engage with the merely interested and elevate them to verbal, then written, agreements. In that sense, the survey-responder and the final written commitment could very well be equivalent in “value”.

The point is, it’s not in my power to make that evaluation. Therefore, this year, everyone with the earmarks of planned giving about them will get the same value: 1. I hope that results in a more statistically defensible method.

(I should add here that although I recognize my model could be improved, I remain convinced that even a flawed predictive model is superior to any assumption-based segmentation strategy. I’ve flogged that dead horse elsewhere.)

A majority of the new expectancies are in the 90s, and most of those are 97 and up. However, you’ll see in the attached that I compare the effectiveness of the PG score with that of the Annual Giving score. It would seem that the AG score does a better job of picking the Planned Giving donors than the PG score does! Even the old “general” model from 2008 does a (slightly) better job.

That’s a bit odd. The first thing I would say is that 11 is a very small sample and it’s hard to generalize from that.

## 4 March 2010

### Why transform the dependent variable?

In a previous post I mentioned in passing that for a particular predictive model using multiple regression, I re-expressed the dependent variable (‘Giving’) as a logarithmic function of the value. A reader commented, “I’m hoping that you will some day address the reasons for re-expressing the DV as a log. I’ve been searching for a good explanation in this context.” I said I’d have to get back to him on that.

Well, that was two months ago. I had to do some research, because I didn’t have the right words to express the reasoning behind transforming the dependent variable. I consulted a variety of texts and synthesized the bits I found to produce the summary below, using examples you’re likely to see in a fundraising database. Some of this will be tough chewing for the stats-innocent; do not feel you need to know this in order to use multiple regression. (For the keeners out there, just be aware that this discussion barely scratches the surface. Typical with all topics in statistics and modeling!)

Multiple regression works most reliably when the inputs come in a form that is well-known. The “form” we’re talking about is the distribution of the data. If the distribution of your data approximates that of a theoretical probability distribution, we can perform calculations on the data that are based on assumptions we can make about the well-known theoretical distribution. (Got that?)

The best-known example ‘theoretical distribution’ is the normal distribution, a.k.a. the famous “bell curve.” It looks just like a bell. (Duh.) The properties and characteristics of the normal probability distribution are well-known, which is important for the validity of the results we see in our regression analysis. (P-values, for example, which inform us whether our predictive variables are significant or not.)

Let’s say our dependent variable is ‘Lifetime Giving‘. When we create a histogram of this variable, we can see that it isn’t distributed normally at all. There’s a whole pile of very small values at one end, and the larger values aren’t visible at all.

In order to make the variable better fit the assumptions underlying regression, we need to transform it. There are a number of ways to do this, but the most common for our purposes is to take the log of ‘Giving’. (This is easily done in Data Desk using a derived variable and the ‘log’ statement; just remember to take the log of ‘Giving’ plus a nominal value of 1, because you can’t take a log of zero.) When we call up a histogram of ‘Log of Lifetime Giving’, we can see that the distribution is significantly closer to the normal probability distribution. It’s a bit skewed to one side, but it’s a big improvement.

For the sake of this demonstration, I have left out all the individuals who have no giving. All those zero values would mess with the distribution, and the effect of the transformation would not be as evident in my chart. In the real world, of course, we include the non-donors. The resulting DV is far from ideal, but again, it’s a big improvement over the untransformed variable.

Our goal in transforming variables is not to make them more pretty and symmetrical, but to make the relationship between variables more linear. Ultimately we want to produce a regression equation which “both characterizes the data and meets the conditions required for accurate statistical inference,” (to quote Jacob Cohen et al., from the excellent text, “Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences,” page 233).

Linear relationships that are not evident using an untransformed form of ‘Lifetime Giving’ may be rendered detectable after transformation. So, in short, we transform variables in hopes of improving the overall model, which after all is a linear model.