Sometimes a question from someone who is new to data mining will have me scratching my head, I am driven back to the data to find the answer — and sometimes a new insight, too.
This time I have to credit someone near the back of the room during a presentation I gave at APRA’s annual conference in Anaheim CA last month. (If this was you, please leave a comment.) The session was called Regression for Beginners. In that session, I talked about two employment-related binary variables: Position Present (i.e., Job Title) and Employer Name Present. Both are strongly correlated with the dependent variable, Lifetime Giving.
However, I warned, we cannot count equal influence for each variable in our model because there is a large degree of overlap — or interaction — between the two predictors. Only regression will account for this interaction and prevent us from “double-counting” the predictive power of these variables. In practice, I explained, I always end up keeping one employment-related variable in the model and excluding the other. Although the two variables are not precisely equivalent, the second variable fails to add significant explanatory power to the model, so I leave it out.
This is when the question was posed: If Position Present and Employer Present are not identical to each other, have I ever tested the condition in which BOTH fields were populated? Do alumni with complete data have more giving?
That stopped me short. No, in fact, it had never occurred to me. It made perfect sense, though, so shortly after my return, I went back to the data for another look. I extracted a file containing every living alum, their lifetime giving, and their Position and Employer fields. I eliminated everyone with giving over $25,000, to lessen the influence of major-gift prospect research on the analysis.
The majority of alums have no employment data at all. About 7.5% have a job title but no company name, and about 3.5% have the opposite — a company name but no job title. The remainder, not quite one-third of alums, have both. When I see how the groups compare by average lifetime giving, the differences are striking:
So while it’s true that position present and employer present are each associated with giving, the association is even stronger when both are present. (These averages include non-donors.)
The next step was to create a new variable called Combination, and give it a value of zero, 1, 2 or 3, depending on what employment data was present: 0 for no data, 1 for Company only, 2 for Position only, and 3 for both. When I compare the strength of linear correlation with LT Giving for Combination as compared with the two old variables, here is what I get:
Combination provides a stronger correlation than either of the others alone. It’s not a massive difference, but it’s an improvement, and that’s all that matters, really. It should do a better job in a regression model, and I won’t have to throw away a good predictor due to redundancy or interaction. There are other ways to whip up new variables from the original two — get creative, and then test.
Every new modeling project brings an opportunity to manipulate variables creatively in order to find new linear relationships that might prove useful. For most variables I am still testing only binary conditions (yes, we have a business phone number for an alum, or no we don’t) for correlation with the outcome variable. Sometimes I test counts of records (eg., number of business phone updates), and even more rarely, I test transformations of continuous variables (eg., natural log of number of business phone updates).
Sometimes I miss even more basic approaches, such as this way to handle employment variables, which is something to try anytime two good predictor variables interact to a high degree.
Thanks to my fellow beginners, my education continues.