CoolData blog

3 September 2010

More on making hay from variables that interact

A short while ago I wrote about pairs of predictor variables that are highly correlated with each other, i.e. that have strong interactions in regression analysis. (Making hay when predictor variables interact.) The example I used was Position Present and Employer Name Present. Instead of using one and throwing the other away as redundant, you can combine them to form a new variable with more predictive power than either of the original two on their own.

In this post, I’ll show you how to identify other likely pairs of variables from which you can try to make similar “combination variables.”

When independent variables interact in regression, it’s called multicollinearity. (You can read a good discussion of multicollinearity on the Stats Make Me Cry blog: Top Ten Confusing Stats Terms Explained in Plain English, #9: Multicollinearity.) Position Present and Employer Name Present is an obvious example, but all kinds of subtle combinations are possible and difficult to foresee. We need to call on some help in detecting the interaction. That help is provided by Pearson’s Product-Moment Correlation, also known as Pearson’s r. I’ve written about Pearson’s r before.

In a nutshell, Pearson’s r calculates a number that describes the strength of linear correlation between any two variables. Your stats software makes this easy. In DataDesk, I select all the icons of the variables I want to calculate correlations for, then find Pearson’s r in the menu. The result is a new window containing a table full of values. If you include many variables at once, this table will be massive. Click on the image below for a full-size version of a Pearson table based on some real data from a university. (You might have to enlarge it in your browser.) The table works exactly like those distance tables you find on old tourist highway maps (they don’t seem to make those anymore — wonder why); to find the distance from, say Albuquerque to Santa Fe, you’d find the number at the intersection of the Albuquerque column with the Santa Fe row, and that would be the number of miles to travel. In the table below, the cities are variables and the mileage is Pearson’s r.

Don’t be intimidated by all the numbers! Just let your eye wander over them. Notice that some are positive, some negative. The negative sign simply means that the correlation between the two variables is negative. Notice also that most of the numbers are small, less than 0.1 (or minus 0.1). As far as multicollinearity is concerned, we’re most interested in these bigger values, i.e. values that are furthest from zero (no correlation) and closest to 1 (perfect linear correlation).

I realize some of the variable names will be a bit mysterious, but you might be able to guess that “Number deg” is Number of Degrees and that “Grad Y” means Graduated. Their Pearson’s r value (0.26) is one of the higher correlations, which makes sense, right?

Noticing certain correlations can teach you things about the data. ‘Female’ is correlated with ‘Class (Year)’ — because at this university, males outnumbered females years ago, but since the 1980s, females have outnumbered males by an ever-increasing factor. On the other hand, ‘Number HC’ (campus reunions attended) is negatively correlated with ‘Class Year’ — older alumni have attended more events (no big surprise), but also young alumni are not big on reunions at this institution.

Look at ‘Business Phone Present’ and ‘Employer Present’. Their r value is relatively high (0.376). I would test some variations of those two. You could add them together, so that the variable could range from 0 to 2. Or you could multiply them, to give you a binary variable that would have a value of 1 only if both of the original variables was 1. You might end up with a predictor that is more highly correlated with ‘Giving’ than either of the original two variables.

With non-binary variables such as ‘Class Year’ and ‘Number Events Attended’, the results of combining will be even more varied and interesting. What you do is up to you; there’s no harm in trying. When you’re done playing, just rank all your old and new variables in order by the absolute value of their strength of linear correlation with your predicted value (say, Giving), and see how the new variables fare.


11 August 2010

Making hay when predictor variables interact

Filed under: Derived variables, Predictor variables — Tags: , — kevinmacdonell @ 8:19 am

Sometimes a question from someone who is new to data mining will have me scratching my head, I am driven back to the data to find the answer — and sometimes a new insight, too.

This time I have to credit someone near the back of the room during a presentation I gave at APRA’s annual conference in Anaheim CA last month. (If this was you, please leave a comment.) The session was called Regression for Beginners. In that session, I talked about two employment-related binary variables: Position Present (i.e., Job Title) and Employer Name Present. Both are strongly correlated with the dependent variable, Lifetime Giving.

However, I warned, we cannot count equal influence for each variable in our model because there is a large degree of overlap — or interaction — between the two predictors. Only regression will account for this interaction and prevent us from “double-counting” the predictive power of these variables. In practice, I explained, I always end up keeping one employment-related variable in the model and excluding the other. Although the two variables are not precisely equivalent, the second variable fails to add significant explanatory power to the model, so I leave it out.

This is when the question was posed: If Position Present and Employer Present are not identical to each other, have I ever tested the condition in which BOTH fields were populated? Do alumni with complete data have more giving?

That stopped me short. No, in fact, it had never occurred to me. It made perfect sense, though, so shortly after my return, I went back to the data for another look. I extracted a file containing every living alum, their lifetime giving, and their Position and Employer fields. I eliminated everyone with giving over $25,000, to lessen the influence of major-gift prospect research on the analysis.

The majority of alums have no employment data at all. About 7.5% have a job title but no company name, and about 3.5% have the opposite — a company name but no job title. The remainder, not quite one-third of alums, have both. When I see how the groups compare by average lifetime giving, the differences are striking:

So while it’s true that position present and employer present are each associated with giving, the association is even stronger when both are present. (These averages include non-donors.)

The next step was to create a new variable called Combination, and give it a value of zero, 1, 2 or 3, depending on what employment data was present: 0 for no data, 1 for Company only, 2 for Position only, and 3 for both. When I compare the strength of linear correlation with LT Giving for Combination as compared with the two old variables, here is what I get:

Combination provides a stronger correlation than either of the others alone. It’s not a massive difference, but it’s an improvement, and that’s all that matters, really. It should do a better job in a regression model, and I won’t have to throw away a good predictor due to redundancy or interaction. There are other ways to whip up new variables from the original two — get creative, and then test.

Every new modeling project brings an opportunity to manipulate variables creatively in order to find new linear relationships that might prove useful. For most variables I am still testing only binary conditions (yes, we have a business phone number for an alum, or no we don’t) for correlation with the outcome variable. Sometimes I test counts of records (eg., number of business phone updates), and even more rarely, I test transformations of continuous variables (eg., natural log of number of business phone updates).

Sometimes I miss even more basic approaches, such as this way to handle employment variables, which is something to try anytime two good predictor variables interact to a high degree.

Thanks to my fellow beginners, my education continues.

4 March 2010

Why transform the dependent variable?

In a previous post I mentioned in passing that for a particular predictive model using multiple regression, I re-expressed the dependent variable (‘Giving’) as a logarithmic function of the value. A reader commented, “I’m hoping that you will some day address the reasons for re-expressing the DV as a log. I’ve been searching for a good explanation in this context.” I said I’d have to get back to him on that.

Well, that was two months ago. I had to do some research, because I didn’t have the right words to express the reasoning behind transforming the dependent variable. I consulted a variety of texts and synthesized the bits I found to produce the summary below, using examples you’re likely to see in a fundraising database. Some of this will be tough chewing for the stats-innocent; do not feel you need to know this in order to use multiple regression. (For the keeners out there, just be aware that this discussion barely scratches the surface. Typical with all topics in statistics and modeling!)

Multiple regression works most reliably when the inputs come in a form that is well-known. The “form” we’re talking about is the distribution of the data. If the distribution of your data approximates that of a theoretical probability distribution, we can perform calculations on the data that are based on assumptions we can make about the well-known theoretical distribution. (Got that?)

The best-known example ‘theoretical distribution’ is the normal distribution, a.k.a. the famous “bell curve.” It looks just like a bell. (Duh.) The properties and characteristics of the normal probability distribution are well-known, which is important for the validity of the results we see in our regression analysis. (P-values, for example, which inform us whether our predictive variables are significant or not.)

Let’s say our dependent variable is ‘Lifetime Giving‘. When we create a histogram of this variable, we can see that it isn’t distributed normally at all. There’s a whole pile of very small values at one end, and the larger values aren’t visible at all.

In order to make the variable better fit the assumptions underlying regression, we need to transform it. There are a number of ways to do this, but the most common for our purposes is to take the log of ‘Giving’. (This is easily done in Data Desk using a derived variable and the ‘log’ statement; just remember to take the log of ‘Giving’ plus a nominal value of 1, because you can’t take a log of zero.) When we call up a histogram of ‘Log of Lifetime Giving’, we can see that the distribution is significantly closer to the normal probability distribution. It’s a bit skewed to one side, but it’s a big improvement.

For the sake of this demonstration, I have left out all the individuals who have no giving. All those zero values would mess with the distribution, and the effect of the transformation would not be as evident in my chart. In the real world, of course, we include the non-donors. The resulting DV is far from ideal, but again, it’s a big improvement over the untransformed variable.

Our goal in transforming variables is not to make them more pretty and symmetrical, but to make the relationship between variables more linear. Ultimately we want to produce a regression equation which “both characterizes the data and meets the conditions required for accurate statistical inference,” (to quote Jacob Cohen et al., from the excellent text, “Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences,” page 233).

Linear relationships that are not evident using an untransformed form of ‘Lifetime Giving’ may be rendered detectable after transformation. So, in short, we transform variables in hopes of improving the overall model, which after all is a linear model.

8 February 2010

How to do basic text-mining

Filed under: Annual Giving, Derived variables, Text, Text mining — Tags: , — kevinmacdonell @ 8:49 am

Turn prose into data for insights into your constituents' behaviour. (Photo used via Creative Commons licence. Click photo for source.)

Database users at universities make frequent use of free-text comment fields to store information. Too frequent use, perhaps. Normally, free text is resorted to only when there’s a need to store information of a type that cannot be conveniently coded (preferably from a pre-established “validation table” of allowed values). Unstructured information such as comments requires some work to turn it into data that can reveal patterns and correlations. This work is called text mining.

Here are steps I took to do some rather crude text-mining on a general-comments field in our database. My method was first to determine which words were used most frequently, then select a few common ‘suggestive’ words that might show interesting correlations, and finally to test the variables I made from them for correlations with giving to our institution.

The comments I was trying to get at were generated from our Annual Giving phonathon. Often these comments flag alumni behaviours such as hanging up on the caller, being verbally abusive, or other negative things. As certain behaviours often prompt the same comments over and over (eg. “hung up on the caller”), I thought that certain frequently-occurring keywords might be negatively correlated with giving.

The method outlined below is rather manual. As well, it focuses on single words, rather than word combinations or phrases. There are some fantastic software packages out there for going much deeper, more quickly. But giving this a try is not difficult and will at least give you a taste for the idea behind text mining.

My method was first to discover the most common words that sounded like they might convey some sense of “attitude”:

  • Using a query in Access, I extracted the text of all comments, plus comment type, from the database – including the ID of the individual. (We use Banner so this data came from the APACOMT screen.)
  • I dumped the data into Excel, and eliminated certain unwanted comments by type code (such as event attendance, bios, media stories, etc.), leaving about 6,600 comments. (I saved this Excel file, to return to later on.)
  • I copied only the column of remaining comments, and pasted this text into a basic text editor. (I like to use EditPad Lite, but anything you have that works with big .txt files is fine.)
  • I used Find-and-replace to change all spaces into carriage returns, so that each word was on one line.
  • I used Find-and-replace again to removed common punctuation (quote marks, periods, commas etc.)
  • I changed all uppercase characters to lowercase characters, so “The” wouldn’t be counted separately from “the”.
  • The result was a very long column of single words. I copied the whole thing, and pasted it into Data Desk, as a single variable.
  • This allowed me to create a frequency table, sorted by count so the most common words would appear at the top. More than 100,000 cases fell into a little less than 5,000 categories (i.e. words).

The most common words were, in order: to, the, a, made, and, be, mail, by, only, from, not, list, removed, nn, of, in, solicitation, he, no, phonathon, she, pledge, is, wishes, said, unhonoured, on, does, was, giving, phone, will, caller, her, donate.

I recognized some of our most common comments, including “made by-mail-only”, “made phonathon no”, “unhonoured pledge”, etc. These states are already covered by specific coding elsewhere in the database, so I skipped over these and looked farther down to some of the more subjective “mood” words, such as “hang” and “hung” (which almost always meant “hung up the phone”), “rude”, “upset”, “never”, “told”, etc.

I went back to my original Excel file of comments and created a few new columns to hold a 0/1 variable for some of these categories. This took some work in Excel, using the “Contains” text filter. So, for example, every comment that contained some variation on the theme of ‘hanging up the phone’ received a 1 in the column called “Hung up”, and all the others got a zero.

From there, it was easy to copy the IDs, with the new variable(s), into Data Desk, where I matched the data up with Lifetime Giving. The idea of course was to discover a new predictor variable or two. For example, it seemed likely that alumni with a 1 for the variable ‘Hung Up’ might have given less than other alumni. As it turned out, though, the individual variables I created on this occasion were not particularly predictive of giving (or of failing to give).

I certainly haven’t given up on the idea, though, because there is much room for improvement in the analysis. For one thing, I was looking for correlations with Lifetime Giving, when I should have specified Phonathon Giving. People who hang up on student callers aren’t non-donors, necessarily; they just don’t care for being contacted by phone. (Why they don’t just ask to be taken off the calling list, I’m not sure.)

In the meantime, this very basic text-mining technique DID prove very useful when I needed to compare two models I had created for our Annual Giving program. I had designed an improved model which specifically targeted phone-receptive alumni, in the hopes of reducing the number of hang-ups and other unpleasant phone encounters. I showed the effectiveness of this approach through the use of text mining, conducted exactly as outlined above. (I’ll detail the results in a future post.)

Do you have a lot of text-based comments in your database? Do you have a lot of text-based response data from (non-anonymous) surveys? Play around with mining that text and see what insights you come up with.

13 December 2009

Adding new variables to existing datafile in Data Desk, Part 2

Filed under: Data Desk, Derived variables — Tags: , , , , , — kevinmacdonell @ 8:04 pm

As explained in Part 1, you can simply paste a new variable into an existing relation IF there are the same number of cases in both the existing relation and the new variable, AND they are in exactly the same order. Otherwise you will have to perform a join, as follows.

As an example, let’s say my existing data set has variables for a unique identifier (ID), First Name, Last Name, and Class Year. And let’s say that for some reason, I need to add Degree as a variable. (For something simple like this, I would of course scrap this data file and re-query the database to get all the data I need; this is just an example.)

I’ll need to query the database for two things, ID and Degree. (Whenever you query your database, you must never fail to include a unique record identifier. In Banner this may be ID or PIDM.)

Copy the columns you want from Excel, Access or whatever your source is.

In Data Desk, go to the Data menu:  New > Relation. You’ll be pasting your new variables into a relation that is separate from the existing relation that contains the names and class years.

Name the folder (let’s call it “Deg“), and paste your variables. For neatness, drag the variables out of the Clipboard folder and into Deg, and drag the empty Clipboard to the trash.

To avoid confusing the new ID variable and the ID variable as already exists, rename the new variable. (eg. ‘ID-degree’)

Create a new derived variable in the Deg relation folder. Name it ‘ID-degree-lookup’, and give it this expression:


Remember to drag each of the two variables into the expression window rather than type them. The Lookup command will literally look up the value of the first argument in the list of values in the second argument.

Now create another derived variable, again in the Deg relation folder. Call it ‘Degree-joined’. Give it this expression:


The GetCase command will retrieve the value of ‘Degree’ that matches each looked-up ID value. The nested TextOf command is needed because the value of ‘Degree’ is non-numerical. If the variable we were adding was a numerical value, such as Lifetime Giving, the TextOf command would not be needed.

You can check if the variable works by choosing Show numbers from the variable’s Hyperview menu, or by dragging the variable into a frequency table.

‘Degree-joined’ is now a full member of both relations – you can drag it into the folder that contains your original data without getting an error message. It’s now available to be used like any other variable in the original data set.

If this seems a little confusing at first, don’t worry. Print these directions out and refer to them as needed. If you are building a predictive model and discovering new variables as you go along, you’ll be calling on this technique so often that eventually you’ll know it by heart.

7 December 2009

Rural vs. Urban postal codes

Filed under: Data Desk, Derived variables — Tags: , , , — kevinmacdonell @ 11:18 am

If your data set contains primarily individuals residing in Canada, you can create a derived indicator variable in Data Desk to identify who does not live in an urban area or small town. Just one more potential predictor variable to test in your models.

First, your data set needs to contain Postal Code. You’ll also need to include Country, if your data set includes non-Canadian addresses.

The expression below will create an indicator variable with a value of ‘1’ (true) for rural Canadian addresses, and ‘0’ (false) for urban addresses.

Drag your own variables for Postal Code and Country into the expression window.

mid('Postal Code',2,1) = 0 AND textof('Country') = "Canada"

Older Posts »

Blog at