CoolData blog

11 August 2010

Making hay when predictor variables interact

Filed under: Derived variables, Predictor variables — Tags: , — kevinmacdonell @ 8:19 am

Sometimes a question from someone who is new to data mining will have me scratching my head, I am driven back to the data to find the answer — and sometimes a new insight, too.

This time I have to credit someone near the back of the room during a presentation I gave at APRA’s annual conference in Anaheim CA last month. (If this was you, please leave a comment.) The session was called Regression for Beginners. In that session, I talked about two employment-related binary variables: Position Present (i.e., Job Title) and Employer Name Present. Both are strongly correlated with the dependent variable, Lifetime Giving.

However, I warned, we cannot count equal influence for each variable in our model because there is a large degree of overlap — or interaction — between the two predictors. Only regression will account for this interaction and prevent us from “double-counting” the predictive power of these variables. In practice, I explained, I always end up keeping one employment-related variable in the model and excluding the other. Although the two variables are not precisely equivalent, the second variable fails to add significant explanatory power to the model, so I leave it out.

This is when the question was posed: If Position Present and Employer Present are not identical to each other, have I ever tested the condition in which BOTH fields were populated? Do alumni with complete data have more giving?

That stopped me short. No, in fact, it had never occurred to me. It made perfect sense, though, so shortly after my return, I went back to the data for another look. I extracted a file containing every living alum, their lifetime giving, and their Position and Employer fields. I eliminated everyone with giving over $25,000, to lessen the influence of major-gift prospect research on the analysis.

The majority of alums have no employment data at all. About 7.5% have a job title but no company name, and about 3.5% have the opposite — a company name but no job title. The remainder, not quite one-third of alums, have both. When I see how the groups compare by average lifetime giving, the differences are striking:

So while it’s true that position present and employer present are each associated with giving, the association is even stronger when both are present. (These averages include non-donors.)

The next step was to create a new variable called Combination, and give it a value of zero, 1, 2 or 3, depending on what employment data was present: 0 for no data, 1 for Company only, 2 for Position only, and 3 for both. When I compare the strength of linear correlation with LT Giving for Combination as compared with the two old variables, here is what I get:

Combination provides a stronger correlation than either of the others alone. It’s not a massive difference, but it’s an improvement, and that’s all that matters, really. It should do a better job in a regression model, and I won’t have to throw away a good predictor due to redundancy or interaction. There are other ways to whip up new variables from the original two — get creative, and then test.

Every new modeling project brings an opportunity to manipulate variables creatively in order to find new linear relationships that might prove useful. For most variables I am still testing only binary conditions (yes, we have a business phone number for an alum, or no we don’t) for correlation with the outcome variable. Sometimes I test counts of records (eg., number of business phone updates), and even more rarely, I test transformations of continuous variables (eg., natural log of number of business phone updates).

Sometimes I miss even more basic approaches, such as this way to handle employment variables, which is something to try anytime two good predictor variables interact to a high degree.

Thanks to my fellow beginners, my education continues.

Advertisements

6 Comments »

  1. I would be curious how this would look if you were able to categorize and code the source of the employment update.

    Also, is there an order effect? Is a person who provides an employment update B4 their first gift more likely to give more LT than someone who gives first and then provides an update.

    Comment by Darren — 11 August 2010 @ 10:44 am

    • Darren,

      I would also be curious to research employment data by source. I have not done so, for various reasons. First, I don’t see a field that contains that information in views provided by the database query tool that I use, so I would need to request that it be added (if it’s available at all). Second, I think you’re probably getting at trying to separate purchased and researched data from alumni-provided data; in Canada there is very little personal data of this sort available for purchase, so it’s not as big a concern here — if we have it, it’s not likely from a big batch upload of data from external sources. Third, categorizing variables by source adds time to data preparation and I would be unlikely to pursue it unless I thought it would yield a useful result. I would be more likely to track down “source” in connection with contact-related data: Home and business phones and addresses. I know we’ve got source data for those, and it’s high on my list of need-to-do’s.

      To date I have had little interest in researching the order of events. That’s not to say it wouldn’t be useful research. However, that is straying into determination of causation, when I am mainly interested in correlation — for now at least. I think order effects have application in modeling for rare events (Planned Giving, Major Giving), but for Annual Giving I am not concerned with how events are ordered in time. Why? Because I view participation in the Annual Fund as a “state of being”, not an event. (For more on this, see my post from 6 July 2010.) That’s just my bias — I would never discourage anyone from delving into this.

      Your two suggestions hint at the myriad creative things we can do to extract more predictive power from our data — there may be more lurking there than we realize on the face of it. There really is no end to the coolness.

      Comment by kevinmacdonell — 11 August 2010 @ 11:18 am

  2. Good point about maximizing the opportunity to include scale variables in the mix. I know that I underutilize them and frequently don’t stop to think of ways to create them to deliver a rich set of data into the mathematics. Good point indeed! Thanks.

    Comment by Diane — 12 August 2010 @ 9:58 am

  3. Hi Kevin,

    I was at your APRA presentation when the question was posted and am glad to have your follow up. Thank you very much! Cristina

    Comment by Cristina MacMahon — 16 August 2010 @ 8:51 am

  4. […] are highly correlated with each other, i.e. that have strong interactions in regression analysis. (Making hay when predictor variables interact.) The example I used was Position Present and Employer Name Present. Instead of using one and […]

    Pingback by More on making hay from variables that interact « CoolData blog — 3 September 2010 @ 6:03 am


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: