# CoolData blog

## 7 September 2010

### Making the dependent variable more dependable

Filed under: Model building, regression — Tags: , — kevinmacdonell @ 10:54 am

## Guest post by Kate Chamberlin and Michelle Paladino, Office of Development, Memorial Sloan-Kettering Cancer Center, New York NY

The most important variable doesn’t always get the attention it deserves. (Creative Commons license. Click image for source.)

It’s an old statistics joke that when building a predictive model, you spend almost all of the time slaving over the data, only so that at the end of the slog, you get to press a button for the actual fun part. Cleaning data, imputing missing values, and restructuring, along with ceaseless contemplation about new and improved independent variables is how we expend much of our energy and rightly so. However, the most important variable doesn’t always get the attention it deserves.

All predictive models center on a particular population, the dependent variable – so giving the target a little extra TLC goes a long way. It’s a balancing act between size and purity. The larger the target size, the more statistical reliability you have, but the more precise the target definition, the better you are able to isolate the behavior that you’re trying to predict. An example is that corporations, foundations, and estates behave differently than individuals. Therefore, if your goal is to find individuals who have a high likelihood to make a major gift, clearing out those estates, foundations and corporations, even if they have given the target amount, will lead to a more trustworthy dependent variable.

And what is the best target cutoff amount for a major gift, anyway? In our Development Office, the Major Gifts program starts at \$50,000. A binary dependent variable with the 1’s defined as individuals who have made a \$50,000+ gift is perfectly reasonable and works just fine, but is this cutoff meaningful from a donor’s perspective? And what about timing – do major donors of long ago look the same as those who have given more recently? We have yet to find the definitive answers, but checking to see if the independent variable distributions change dramatically with different targets and running models with a few flavors of target populations is a good way to evaluate if these changes make a difference.

Another method that can help you more clearly define the dependent variable is to consider to which donors you will be applying your model scores. For instance, if you work in a strictly donor database, as we do, and you are modeling for major donors, it is a good idea to exclude from your target those who came onto the file at the major giving target amount. In other words, remove the individuals whose first gift was \$50,000+ because if the scores will be applied to donors who are giving below the target right now, then your dependent variable should only include a population that gave below the target level and then jumped up to the target amount.

But when does the pruning of your target go too far? If it becomes too small, then the performance of a few donors can have a big effect. A minimum sample size of 30 is a magic rule-of-thumb that is mentioned regularly in the classroom.  If we were to approach that number in our dependent variable, we would be likely to redefine our target to increase the sample size.  In the example above, we might choose to lower the major gift threshold to \$25,000.  We’d definitely be interested to hear about less “magical” methods you might use to determine a lower bound for your target sample size!