CoolData blog

3 September 2015

Testing associations between two categorical variables


Over the coming months I will be adapting various bits from a work in progress, a new book on building predictive models using regression, aimed at people working in nonprofit and higher education advancement.


(If you’re interested, I suggest subscribing via email — see the box to the right — to have the inside track on this project.)


In my previous post, I wrote about “Exploring associations between variables.” Today I go deeper to describe exactly how to test for an association or relationship between two variables that are categorical.


These directions are specific to the statistics software Data Desk, but apply equally to any stats package. I make frequent reference to contingency tables, which you may know better as cross-tabs or R x C tables (row by column tables).


As well, I mention a data file from “Data University.” This is a file of anonymized sample data provided to me by Peter Wylie and John Sammis, which readers of my book will be able to download and use to learn the concepts.


In your data file for Data University, you’ve got a variable for “Business phone present” and “Job title present”. You’d like to know if these two factors are related to each other. If they are completely unrelated, they are said to be independent of each other. We can analyze this in Data Desk using contingency tables. (If you don’t know why we care about dependence or independence among variables, read my previous post.)


Click once on the icon for ‘BusPhone Present’ to select it as the first variable, then Shift-click on ‘JobTitle Present’ to select it as the second variable. Under the Calc menu at the top of the screen, select Contingency Tables. The displayed result will depend on what default settings are in effect, but it might look like the table below. (Just as with frequency tables, you can play with display options from a menu available by clicking on the little triangle in the top left corner of the window.)




A contingency table gives us counts of how many cases fall into the categories that result from the intersection of two categorical variables. In this example, we have two binary variables, therefore there are four possibilities, arranged in a two-by-two matrix. The two “levels” of ‘BusPhone Present’ are labeled vertically (down the rows), and the two ‘levels’ of ‘Job Title Present’ are labeled horizontally (across the columns).


The labeling of rows and columns in contingency tables can be a bit confusing when the category names are just ones and zeroes, so here’s the same table with more descriptive labels.




If these variables are properly coded, with no missing values, each and every case in our data falls into one of the four cells of the matrix. We see for example that 2,240 cases are ‘0’ for ‘BusPhone Present’ and ‘0’ for ‘JobTitle Present’ – a large number of cases have neither piece of information coded in the database.


Looking at counts of cases in a contingency table can sometimes show cells with unusually many or unusually few cases. In this example, it seems to be fairly common that whenever Data University has a business phone number, it tends to have a job title as well, and it’s also common for both to be missing. It’s much less common for one to be present and the other not. This suggests there’s a relationship.


We’ve made a judgment call, just looking at the numbers. Fortunately the field of statistics provides some tests for making decisions more consistently and easily. If you have some training in stats you will readily grasp how we can apply these tests. For our purposes, we will rely on rules of thumb and simplified explanations rather than delve into a deep understanding of the tests themselves.


Click on the options menu for the contingency table, and add “Expected value” to the table, then click OK. This adds a new set of numbers to the table. The section at the bottom of the table, “table contents,” lists these elements in the order they appear in each cell. The first element is the count of cases that we’ve already seen. The other element is Expected Values.




“Expected values” are the counts of cases that would be expected to result if ‘BusPhone Present’ and ‘JobTltle Present’ were statistically independent of each other. In other words, if the two variables were completely unrelated, the probability that any one case would end up in a particular cell depends only on the probability that the case falls in a specified row and the probability that it falls in a specified column. In other words, we would expect the number of people with a business phone number AND who have a job title in the database to depend solely on the number of each in the sample – one doesn’t depend on the other. By this logic, the probability of a person having a business phone in the database is about the same whether or not they have a job title in the database, if it is true that one is not related to the other.


Have a quick look to compare each pair of numbers. You’ll notice that two actual counts are higher than the expected value, and two counts are lower than the expected value. If the variables were not associated with each other (independent), the actual counts would be a lot closer to the expected counts. It appears that if one variable is a zero or one, the other variable is more likely than not to have the same value – that is, we are likely to have either both or neither. Therefore, the variables do seem to be related.


Being able to compare the expected values to the observed values is helpful in identifying a pattern or relationship. But again, we are still just eyeballing the numbers and making a guess. Let’s take another step forward.


An assertion that two variables are not related is based on a concept in statistics called the null hypothesis. The null hypothesis in this case would state, “There is no relationship between Business Phone Present and Job Title Present.” In a test for independence between Business Phone and Job Title, if there is no relationship, then the null hypothesis is true. The alternative hypothesis is that the variables are in fact related – they are dependent, rather than independent.


The test for determining whether the null hypothesis is true in this case is called the chi-square test for independence. For our purposes, it is not necessary to formally state any kind of hypothesis, nor is it necessary to understand how chi-square is calculated. (Chi is pronounced “kai”.) Data Desk takes care of the calculations, and a rule of thumb for interpreting the result will suffice. (1)


Go back to the options for the contingency table. Deselect “Expected value”, and choose “Chi-square value” instead. Data Desk will then calculate the chi-square statistic. If it’s large, that is enough to reject the null hypothesis. This number is not interpreted on its own. Rather, the important statistic to watch for appears directly underneath it: p. The p-value in the table below is less than or equal to 0.0001 – a very small value.




The lowercase p stands for “probability.” A p-value of 0.0001 is the same as a 0.01% chance of something happening. We ask this question: “If there were no relationship between ‘BusPhone Present’ and “JobTitle Present’, what is the probability that you would get a result for chi-square at least as high as 1,141?” That probability is your p-value.


By convention, you need a p-value of 0.05 (i.e. 5 percent) or less to consider the probability low enough to conclude that there must, in fact, be a relationship between ‘BusPhone Present’ and ‘JobTitle Present’. (In other words, you can reject the null hypothesis.) The probability of obtaining a value of chi-square this large if the variables were independent is extremely low: less than 0.01%. The p-value in this case is very small, therefore the variables are strongly related.


We will encounter p-values again when we do regression. (2)


Given the foregoing discussion, you might be thinking that exploring relationships among variables is a complicated and subtle business. Not really. You don’t have to study expected values or formulate a null hypothesis – I introduced those things to help you understand where chi-square comes from. You only need these steps to test for presence of a relationship:


  1. Select the icons of the two variables.
  2. Create a contingency table from the Calc menu.
  3. In the table options menu, select “Chi-square value.”
  4. Regardless of the chi-square value, if the p-value is less than 0.05, the variables are not independent – they are related somehow.


You can create a new contingency table whenever you want to test two new variables, or you can simply drag a new category variable into an existing table. Just drag a new categorical variable icon on top of the name of the variable you want to replace. These names are in bold face text near the top of the table window, and will highlight when touched by the cursor during the drag. You can replace both factors at once by selecting two new variables and dragging them both into the centre of the table.


Before we move on, try dragging other binary variables into the contingency table to quickly see which combinations yield some interesting associations.


A more relevant example


Dragging variables one by one into a contingency table is a useful exploration step while evaluating potential predictors for use in a model. This implies that one of the variables must be the target (outcome) variable.


Let’s say you plan to build a predictive model to identify which alumni are more likely to make a donation at a higher level than usual, and you want to make a preliminary assessment of potential predictors so you can toss out any that don’t look promising, while keeping the others for use in the regression analysis. You can create a target variable using a lifetime giving variable that will allow you to do this analysis using contingency tables.


Working with the Data University file, create a new derived variable called ‘Big donor’ with the expression:


‘Lifetime HC’ > 999


This will create a binary variable that evaluates to ‘1’ for alumni with lifetime hard-credit giving greater than $999, and ‘0’ for everyone else.


You can use whatever dollar value you like, but $1,000 and up will work here. According to a frequency table of ‘Big donor’, 602 people have lifetime giving of $1,000 or more.


Click on the icon for ‘Big donor’ to select it and create a contingency table from the Calc menu, and change the table options to include chi-square. Now the window is ready for testing variables one by one, by dragging variables into the space that is initially labeled “Drag Variable Here.”


Chi-square is great for indicating that two variables are related, but a little more information will help you understand the nature of the relationship. When you drag in “BusPhone Present,” for example, it seems apparent that the relationship is significant, to judge from the chi-square value. One tweak will tell you something more concrete: Go back to the table options, deselect “Count” and select “Percent of column total.”




This replaces cell counts with percent values – the percent breakdown for each column. The circled values in the table above are the ones to pay attention to – they are the cases that have a ‘1’ for ‘Big donor’. Here is what we can read from this:


  • The first column of figures includes all cases for which there is no business phone in the database (0). Only 8.39% of people with no business phone are also top donors.
  • The second column includes all cases with a business phone (1) – 18.4% of people with a business phone are top donors.


People with a business phone are more than twice as likely to be top donors, and the chi-square value indicates that the relationship is significant. It might be hasty to conclude that ‘BusPhone Present’ is truly a “predictor” of high-level giving, but the association is clearly there in the data.


(These percentages total on the columns. If ‘Big donor’ were your second variable instead of your first, the matrix would be ordered in the other direction, and you would choose “Percent of row total” from the table options instead.)


Try dragging other binary variables from the Data University set into the table, replacing ‘BusPhone Present’, and observe how the percent breakdowns differ from group to group, and whether that difference is significant.


In my book-in-progress, I go on to talk about exploring categorical variables with multiple categories – not just binary variables – but that’s all for today. Again, if you’re interested in knowing more as this project progress, please subscribe to this blog via email from the box in the right sidebar.


End notes


1) If you must know, the value of chi-square is the sum of the squared standardized residuals across all cells in the table. The standardized residual describes the extent to which the observed count differs from the expected count. You can display this statistic in Data Desk along with the others, but there is no need. Another abbreviation in the table is “df”, which stands for degrees of freedom. Probability values of chi-square depend upon the degrees of freedom, which in turn is related to the number of rows and columns in the table.


2) Use and misuse of p-values is a hot topic in statistics, especially in connection with academic publishing, where putting too much stock in p-values and the rather arbitrary 0.05 threshold causes a great deal of angst. As predictive modelers we need to keep these controversies in perspective: We are not testing the effects of new medical treatments nor are we seeking to publish in a scientific journal. We are making use of “good-enough” tools that will produce a useful result. That said, applying the 0.05 threshold without judgment and common sense means that we will detect “relationships” that are not real – potentially about 5% of the time!



7 January 2015

New finds in old models


When you build a predictive model, you can never be sure it’s any good until it’s too late. Deploying a mediocre model isn’t the worst mistake you can make, though. The worst mistake would be to build a second mediocre model because you haven’t learned anything from the failure of the first.


Performance against a holdout data set for validation is not a reliable indicator of actual performance after deployment. Validation may help you decide which of two or more competing models to use, or it may provide reassurance that your one model isn’t total junk. It’s not proof of anything, though. Those lovely predictors, highly correlated with the outcome, could be fooling you. There are no guarantees they’re predictive of results over the year to come.


In the end, the only real evidence of a model’s worth is how it performs on real results. The problem is, those results happen in the future. So what is one to do?


I’ve long been fascinated with Planned Giving likelihood. Making a bequest seems like the ultimate gesture of institutional affinity (ultimate in every sense). On the plus side, that kind of affinity ought to be clearly evidenced in behaviours such as event attendance, giving, volunteering and so on. On the negative side, Planned Giving interest is uncommon enough that comparing expectancies with non-expectancies will sometimes lead to false predictors based on sparse data. For this reason, my goal of building a reliable model for predicting Planned Giving likelihood has been elusive.


Given that a validation data set taken from the same time period as the training data can produce misleading correlations, I wondered whether I could do one better: That is, be able to draw my holdout sample not from data of the same time period as that used to build the model, but from the future.


As it turned out, yes, I could.


Every year I save my regression analyses as Data Desk files. Although I assess the performance of the output scores, I don’t often go back to the model files themselves. However, they’re there as a document of how I approached modelling problems in the past. As a side benefit, each file is also a snapshot of the alumni population at that point in time. These data sets may consist of a hundred or more candidate predictor variables — a well-rounded picture.


My thinking went like this: Every old model file represents data from the past. If I pretend that this snapshot is really the present, then in order to have access to knowledge of the future, all I have to do is look at today’s data stored in the database.


For example, for this blog post, I reached back two years to a model I created in Data Desk for predicting likelihood to upgrade to the Leadership level in Annual Giving. I wasn’t interested in the model itself. Rather, I wanted to examine the underlying variables I had to work with at the time. This model had been an ambitious undertaking, with some 170 variables prepared for analysis. Many of course were transformations of variables or combinations of interacting variables. Among all those variables was one indicating whether a case was a current Planned Giving expectancy or not, at that point in time.


In this snapshot of the database from two years ago, some of the cases that were not expectancies would have become so since then. In other words, I now had the best of both worlds. I had a comprehensive set of potential predictors as they existed two years ago, AND access to the hitherto unknowable future: The identities of the people who had become expectancies after the predictors had been frozen in time.


As I said, my old model was not intended to predict Planned Giving inclination. So I built a new model, using “Is an Expectancy” (0/1) as the target variable. I trained the regression model on the two-year-old expectancy data — I didn’t even look at the new expectancies while building the model. No: I used those new expectancies as my validation data set.


“Validation” might be too strong a word, given that there were only 80 or so new cases. That’s a lot of bequest intentions, for sure, but in terms of data it’s a drop in the bucket compared with the number of cases being scored. Let’s call it a test data set. I used this test set to help me analyze the model, in a couple of ways.


First I looked at how new expectancies were scored by the model I had just built. The chart below shows their distribution by score decile. Slightly more than 50% of new expectancies were in the top decile. This looks pretty good — keeping in mind that this is what actual performance would have looked like had I really built this model two years ago (which I could have):




(Even better, looking at percentiles, most of the expectancies in that top 10% are concentrated nicely in the top few percentiles.)


But I didn’t stop there. It is also evident that almost half of new expectancies fell outside the top 10 percent of scores, so clearly there was room for improvement. My next step was to examine the individual predictors I had used in the model. These were of course the predictors most highly correlated with being an expectancy. They were roughly the following:
  • Year person’s personal information in the database was last updated
  • Number of events attended
  • Age
  • Year of first gift
  • Number of alumni activities
  • Indicated “likely to donate” on 2009 alumni survey
  • Total giving in last five years (log transformed)
  • Combined length of name Prefix + Suffix


I ranked the correlation of each of these with the 0/1 indicator meaning “new expectancy,” and found that most of the predictors were still fine, although they changed their order in the rank correlation. Donor likelihood (from survey) and recent giving were more important, and alumni activities and how recently a person’s record was updated were less important.


This was interesting and useful, but what was even more useful was looking at the correlations between ALL potential predictors and the state of being a new expectancy. A number of predictors that would have been too far down the ranked list to consider using two years ago were suddenly looking much better. In particular, many variables related to participation in alumni surveys bubbled closer to the top as potentially significant.


This exercise suggests a way to proceed with iterative, yearly improvements to some of your standard models:
  • Dig up an old model from a year or more ago.
  • Query the database for new cases that represent the target variable, and merge them with the old datafile.
  • Assess how your model performed or, if you created more than one model, see which model would have performed best. (You should be doing this anyway.)
  • Go a layer deeper, by studying the variables that went into those models — the data “as it was” — to see which variables had correlations that tricked you into believing they were predictive, and which variables truly held predictive power but may have been overlooked.
  • Apply what you learn to the next iteration of the model. Leave out the variables with spurious correlations, and give special consideration to variables that may have been underestimated before.

2 May 2013

New twists on inferring age from first name

Filed under: Analytics, Coolness, Data Desk, Fun — Tags: , , , — kevinmacdonell @ 6:14 am

Not quite three years ago I blogged about a technique for estimating the age of your database constituents when you don’t have any relevant data such as birth date or class year. It was based on the idea that many first names are typically “young” or “old.” I expanded on the topic in a followup post: Putting an age-guessing trick to the test. Until now, I’ve never had a reason to guess someone’s age — alumni data is pretty well supplied in that department. This very month, though, I have not one but two major modeling projects to work on that involve constituents with very little age data present. I’ve worked out a few improvements to the technique which I will share today.

First, here’s the gist of the basic idea. Picture two women, named Freda and Katelyn. Do you imagine one of them as older than the other? I’m guessing you do. From your own experience, you know that a lot of young women and girls are named Katelyn, and that few if any older women are. Even if you aren’t sure about Freda, you would probably guess she’s older. If you plug these names into, you’ll see that Freda was a very popular baby name in the early 1900s, but fell out of the Top 1000 list sometime in the 1980s. On the other hand, Katelyn didn’t enter the Top 1000 until the 1970s and is still popular.

To make use of this information you need to turn it into data. You need to acquire a lot of data on the frequency of first names and how young or old they tend to be. If you work for a university or other school, you’re probably in luck: You might have a lot of birth dates for your alumni or, failing that, you have class years which in most cases will be a good proxy for age. This will be the source you’ll use for guessing the age of everyone else in your database — friends, parents and other person constituents — who don’t have ages. If you have a donor database that contains no age data, you might be able to source age-by-first name data somewhere else.

Back to Freda and Katelyn … when I query our database I find that the average age of constituents named Freda is 69, while the average age for Katelyn is 25. For the purpose of building a model, for anyone named Freda without an age, I will just assume she is 69, and for anyone named Katelyn, 25. It’s as simple as creating a table with two columns (First name and Average age), and matching this to your data file via First Name. My table has more than 13,500 unique first names. Some of these are single initials, and not every person goes by their first name, but that doesn’t necessarily invalidate the average age associated with them.

I’ve tested this method, and it’s an improvement over plugging missing values with an all-database average or median age. For a data set that has no age data at all, it should provide new information that wasn’t there before — information that is probably correlated with behaviours such as giving.

Now here’s a new wrinkle.

In my first post on this subject, I noted that some of the youngest names in our database are “gender flips.” Some of the more recent popular names used to be associated with the opposite gender decades ago. This seems to be most prevalent with young female names: Ainslie, Isadore, Sydney, Shelly, Brooke. It’s harder to find examples going in the other direction, but there are a few, some of them perhaps having to do with differences in ethnic origin: Kori, Dian, Karen, Shaune, Mina, Marian. In my data I have close to 600 first names that belong to members of both sexes. When I calculate average age by First Name separately for each sex, some names end up with the exact same age for male and female. These names have an androgynous quality to them: Lyndsay, Riley, Jayme, Jesse, Jody. At the other extreme are the names that have definitely flipped gender, which I’ve already given examples of … one of the largest differences being for Ainslie. The average male named Ainslie is 54 years older than the average female of the same name. (In my data, that is.)

These differences suggest an improvement to our age-inferring method: Matching on not just First Name, but Sex as well. Although only 600 of my names are double-gendered, they include many popular names, so that they actually represent almost one-quarter of all constituents.

Now here’s another wrinkle.

When we’re dealing with constituents who aren’t alumni, we may be missing certain personal information such as Sex. If we plan to match on Sex as well as First Name, we’ve got a problem. If Name Prefix is present, we can infer from whether it’s Mr., Ms., etc., but unless the person doing the data entry was having an off day, this shouldn’t be an avenue available to us — it should already be filled in. (If you know it’s “Mrs.,” then why not put in F for Sex?) For those records without a Sex recorded (or have a Sex of ‘N’), we need to make a guess. To do so, we return to our First Names query and the Sex data we do have.

In my list of 600 first names that are double-gendered, not many are actually androgynous. We have females named John and Peter, and we have males named Mary and Laura, but we all know that given any one person named John, chances are we’re talking about a male person. Mary is probably female. These may be coding errors or they may be genuine, but in any case we can use majority usage to help us decide. We’ll sometimes get it wrong — there are indeed boys named Sue — but if you have 7,000 Johns in your database and only five of them are female, then let’s assume (just for the convenience of data mining*) that all Johns are male.

So: Query your database to retrieve every first name that has a Sex code, and count up the instance of each. The default sex for each first name is decided by the highest count, male or female. To get a single variable for this, I subtract the number of females from the number of males for each first name. Since the result is positive for males and negative for females, I call it a “Maleness Score” — but you can do the reverse and call it a Femaleness Score if you wish! Results of zero are considered ties, or ‘N’.

At this point we’ve introduced a bit of circularity. For any person missing Age and Sex, first we have to guess their sex based on the majority code assigned to that person’s first name, and then go back to the same data to grab the Age that matches up with Name and Sex. Clearly we are going to get it very wrong for a lot of records. You can’t expect these guesses to hold up as well as true age data. Overall, though, there should be some signal in all that noise … if your model believes that “Edgar” is male and 72 years of age, and that “Brittany” is female and 26, well, that’s not unreasonable and it’s probably not far from the truth.

How do we put this all together? I build my models in Data Desk, so I need to get all these elements into my data file as individual variables. You can do this any way that works for you, but I use our database querying software (Hyperion Brio). I import the data into Brio as locally-saved tab-delimited files and join them up as you see below. The left table is my modeling data (or at least the part of it that holds First Name), and the two tables on the right hold the name-specific ages and sexes from all the database records that have this information available. I left-join each of these tables on the First Name field.

age_tablesWhen I process the query, I get one row per ID with the fields from the left-hand table, plus the fields I need from the two tables on the right: the so-called Maleness Score, Female Avg Age by FName, Male Avg Age by Fname, and N Avg Age by Fname. I can now paste these as new variables into Data Desk. I still have work to do, though: I do have a small amount of “real” age data that I don’t want to overwrite, and not every First Name has a match in the alumni database. I have to figure out what I have, what I don’t have, and what I’m going to do to get a real or estimated age plugged in for every single record. I write an expression called Age Estimated to choose an age based on a hierarchical set of IF statements. The text of my expression is below — I will explain it in plain English following the expression.

if len('AGE')>0 then 'AGE'

else if textof('SEX')="M" and len('M avg age by Fname')>0 then 'M avg age by Fname'
else if textof('SEX')="M" and len('N avg age by Fname')>0 then 'N avg age by Fname'
else if textof('SEX')="M" and len('F avg age by Fname')>0 then 'F avg age by Fname'

else if textof('SEX')="F" and len('F avg age by Fname')>0 then 'F avg age by Fname'
else if textof('SEX')="F" and len('N avg age by Fname')>0 then 'N avg age by Fname'
else if textof('SEX')="F" and len('M avg age by Fname')>0 then 'M avg age by Fname'

else if textof('SEX')="N" and 'Maleness score'>0 and len('M avg age by Fname')>0 then 'M avg age by Fname'
else if textof('SEX')="N" and 'Maleness score'<0 and len('F avg age by Fname')>0 then 'F avg age by Fname'
else if textof('SEX')="N" and 'Maleness score'=0 and len('N avg age by Fname')>0 then 'N avg age by Fname'

else if len('N avg age by Fname')>0 then 'N avg age by Fname'
else if len('F avg age by Fname')>0 then 'F avg age by Fname'
else if len('M avg age by Fname')>0 then 'M avg age by Fname'

else 49

Okay … here’s what the expression actually does, going block by block through the statements:

  1. If Age is already present, then use that — done.
  2. Otherwise, if Sex is male, and the average male age is available, then use that. If there’s no average male age, then use the ‘N’ age, and if that’s not available, use the female average age … we can hope it’s better than no age at all.
  3. Otherwise if Sex is female, and the average female age is available, then use that. Again, go with any other age that’s available.
  4. Otherwise if Sex is ‘N’, and the Fname is likely male (according to the so-called Maleness Score), then use the male average age, if it’s available. Or if the first name is probably female, use the female average age. Or if the name is tied male-female, use the ‘N’ average age.
  5. Otherwise, as it appears we don’t have anything much to go on, just use any available average age associated with that first name: ‘N’, female, or male.
  6. And finally, if all else fails (which it does for about 6% of my file, or 7,000 records), just plug in the average age of every constituent in the database who has an age, which in our case is 49. This number will vary depending on the composition of your actual data file — if it’s all Parents, for example, then calculate the average of Parents’ known ages, excluding other constituent types.

When I bin the cases into 20 roughly equal groups by Estimated Age, I see that the percentage of cases that have some giving history starts very low (about 3 percent for the youngest group), rises rapidly to more than 10 percent, and then gradually rises to almost 18 percent for the oldest group. That’s heading in the right direction at least. As well, being in the oldest 5% is also very highly correlated with Lifetime Giving, which is what we would expect from a donor data set containing true ages.


This is a bit of work, and probably the gain will be marginal a lot of the time. Data on real interactions that showed evidence of engagement would be superior to age-guessing, but when data is scarce a bit of added lift can’t hurt. If you’re concerned about introducing too much noise, then build models with and without Estimated Age, and evaluate them against each other. If your software offers multiple imputation for missing data as a feature, try checking that out … what I’m doing here is just a very manual form of multiple imputation — calculating plausible values for missing data based on the values of other variables. Be careful, though: A good predictor of Age happens to be Lifetime Giving, and if your aim is to predict Giving, I should think there’s a risk your model will suffer from feedback.

* One final note …

Earlier on I mentioned assuming someone is male or female “just for the convenience of data mining.”  In our databases (and in a conventional, everyday sense too), we group people in various ways — sex, race, creed. But these categories are truly imperfect summaries of reality. (Some more imperfect than others!) A lot of human diversity is not captured in data, including things we formerly thought of as clear-cut. Sex seems conveniently binary, but in reality it is multi-category, or maybe it’s a continuous variable. (Or maybe it’s too complex for a single variable.) In real life I don’t assume that when someone in the Registrar’s Office enters ‘N’ for Sex that the student’s data is merely missing. Because the N category is still such a small slice of the population I might treat it as missing, or reapportion it to either Male or Female as I do here. But that’s strictly for predictive modeling. It’s not a statement about transgendered or differently gendered people nor an opinion about where they “belong.”

28 April 2010

Pearson product-moment correlation coefficient

Filed under: Data Desk, Model building, regression, Statistics — Tags: , , , — kevinmacdonell @ 8:54 am

You can’t have a serious blog post related to statistics without tossing in the name of a dead white guy. How dead? Well, yesterday was the 74th anniversary of Karl Pearson‘s death in Surrey, England. How white? Pearson was a fan of eugenics, social Darwinism, and the “struggle of race with race” – with the supposedly best race winning. Charming! Like him or not, his contribution to statistics (and therefore science) was huge.

I’ve written a bit about multiple linear regression, but not a great deal about how to do it. After I’ve got my data file ready and before I open up a regression window in Data Desk, ranking my predictor variables using ol’ Pearson’s measure of correlation is Step One. The object is to rank your predictor variables (a.k.a. independent variables) according to the strength (either positive or negative) of their linear correlation with your predicted value (a.k.a. dependent variable). I do this in order to determine the order in which I will introduce my predictor variables into my regression analysis. (All of the following discussion assumes that your variables are numerical; either continuous variables such as ‘class year’, or 0/1 indicator variables you’ve created from your categorical variables.)

The tool I’ve been taught to use is Pearson’s Product-Moment Correlation Coefficient, also called Pearson’s r. This is a quantitative tool which yields a coefficient that describes the slope of the line (as can best be determined) between your Y variable (say, ‘giving’) and one of your X’s (say, ‘class year’). A value of 1 denotes a perfectly linear correlation in a positive direction, and a value of -1 is a perfect negative correlation. All possible values fall between 1 and minus 1. Values near zero denote the absence of a linear correlation; they might be correlated in some other way, but not linearly.

(Click for larger view.)

A scatterplot of the two variables ‘giving‘ and ‘class year’, shown here, will reveal a relationship visually: ‘giving’ tends to decrease as ‘class year’ increases. This negative linear relationship is indicated by the downward-sloping line. The Pearson Correlation reveals the same thing, but not visually: it puts an actual number to it, and that number describes the slope of the line. Why is this important? Because for many of your variables, a scatterplot is just going to look like a mess – the linear relationship is in there somewhere, but it’s not evident from the cloud of points. If you have a calculated value instead, you can easily decide which linear relationships demand priority attention.

It’s easy to do in Data Desk. Just select the icon for ‘giving’ as your Y and also select all of your predictor variables (x), then go to Calc in the menu. Select Correlations, then Pearson Product-Moment. If you have a lot of variables, the resulting table will be impressively large. It will look fearsome or beautiful, depending on how you feel about being faced with a wall of numbers. (I think it’s gorgeous.) To find the value that relates one variable to another, find the intersection of the row and column of the two variables. For example, in the table below, the Pearson correlation value for ‘giving’ and ‘class year’ is -0.460. (The correlation of any variable with itself is, of course, a perfect 1.)

Have a look around this table. Don’t be concerned about the actual values. Just see which values are higher than others. For instance, look at the intersection of “position present” and “employer present“. It’s a very high value: 0.812, which is very close to 1! This tells us that these two predictor variables are going to “interact” with each other when we bring them into the regression analysis. It makes sense: Job title and employer name are likely to be present or absent in tandem, although not perfectly. The practical result is that one of these variables will prove to be a significant predictor, while the other adds little or nothing new, and will be left out.

So how do I decide which variable gets added before the other? It’s simple.

The only part of the whole Pearson table that we’re interested in is the column of values under the heading ‘giving’. Data Desk allows us to copy the table as text and paste it into Excel. When I do this and strip out all the stuff I’m not interested in, the result looks like this. (I’ve resorted the variables alphabetically).

Next, I sort the variables according to their Pearson correlation with Giving. The variables with the highest values will head the list. But notice a small problem: The strongest NEGATIVE variables end up at the very bottom. Really, with its high correlation with ‘giving’, the class year variable should rank first. So I do one extra step, creating a column with an Excel formula for the absolute value of the Pearson coefficient (i.e. without the minus sign), and re-sort on that value.

This gives me a clear idea of the order in which I should add variables to the regression. For example, ’employer present’ seems to edge out ‘position present’. Due to variable interaction, though, the final roster of which variables will stay and which will go is NOT evident at this point. The proof is in the regression, where all sorts of interesting and unforeseeable interactions may crop up.

You don’t have to take this manual approach to adding your variables – your software probably offers an automated, or partially automated, method called stepwise regression. But after all the work of preparing my predictors, I enjoy watching the way they interact with each other as I work through training the model. The way I see it, the more hands-on you are with your analysis, the more you learn about your data.

Final note: The examples above actually use a transformed value of ‘giving’ – the log of giving. Transforming our dependent variable using a logarithmic function is a perfectly valid way to make the linear relationships among variables much more evident. (Why we transform variables is explained more fully here.) If we used ‘giving’ just as it is, the Pearson values would be very low, which would indicate only a very weak linear relationship. Even ‘class year’ would have a low value, which we know isn’t a good description of the reality, which is better represented by the scatterplot above.

13 December 2009

Adding new variables to existing datafile in Data Desk, Part 2

Filed under: Data Desk, Derived variables — Tags: , , , , , — kevinmacdonell @ 8:04 pm

As explained in Part 1, you can simply paste a new variable into an existing relation IF there are the same number of cases in both the existing relation and the new variable, AND they are in exactly the same order. Otherwise you will have to perform a join, as follows.

As an example, let’s say my existing data set has variables for a unique identifier (ID), First Name, Last Name, and Class Year. And let’s say that for some reason, I need to add Degree as a variable. (For something simple like this, I would of course scrap this data file and re-query the database to get all the data I need; this is just an example.)

I’ll need to query the database for two things, ID and Degree. (Whenever you query your database, you must never fail to include a unique record identifier. In Banner this may be ID or PIDM.)

Copy the columns you want from Excel, Access or whatever your source is.

In Data Desk, go to the Data menu:  New > Relation. You’ll be pasting your new variables into a relation that is separate from the existing relation that contains the names and class years.

Name the folder (let’s call it “Deg“), and paste your variables. For neatness, drag the variables out of the Clipboard folder and into Deg, and drag the empty Clipboard to the trash.

To avoid confusing the new ID variable and the ID variable as already exists, rename the new variable. (eg. ‘ID-degree’)

Create a new derived variable in the Deg relation folder. Name it ‘ID-degree-lookup’, and give it this expression:


Remember to drag each of the two variables into the expression window rather than type them. The Lookup command will literally look up the value of the first argument in the list of values in the second argument.

Now create another derived variable, again in the Deg relation folder. Call it ‘Degree-joined’. Give it this expression:


The GetCase command will retrieve the value of ‘Degree’ that matches each looked-up ID value. The nested TextOf command is needed because the value of ‘Degree’ is non-numerical. If the variable we were adding was a numerical value, such as Lifetime Giving, the TextOf command would not be needed.

You can check if the variable works by choosing Show numbers from the variable’s Hyperview menu, or by dragging the variable into a frequency table.

‘Degree-joined’ is now a full member of both relations – you can drag it into the folder that contains your original data without getting an error message. It’s now available to be used like any other variable in the original data set.

If this seems a little confusing at first, don’t worry. Print these directions out and refer to them as needed. If you are building a predictive model and discovering new variables as you go along, you’ll be calling on this technique so often that eventually you’ll know it by heart.

11 December 2009

Adding new variables to existing datafile in Data Desk, Part 1

Filed under: Data Desk — Tags: , , , — kevinmacdonell @ 11:47 am

When I form a query on our database using MS Access, I try to make it as complete as possible before starting work in Data Desk. I ask myself, do I have all the columns (variables) I need for this task? If not, I modify and run the query again.

Invariably, though, I realize I’ve forgotten something only after I am well underway in Data Desk and starting from scratch is not an option. Or sometimes the data set has to contain so many variables (such as in any typical predictive modeling project) that there’s no way I can pull them out of the database with just one query (MS Access can handle only so much at once).

There is a lazy way to do this, which I do not recommend. You can simply paste a new variable into an existing relation, under these conditions:

  • The number of cases in your new variable is exactly the same as in your existing data file.
  • All the cases (‘rows’ in Excel) in the new variable are in exactly the same order as your existing data file.

These conditions can be easily met if your existing data and your new data is coming from the same static spreadsheet stored on your own hard drive. But a university database is a living thing. The number of ‘living, addressable alumni’ changes by the hour, so I can never be sure that any new variable I pull out of it is going to match up with the data I pulled yesterday.

Even if you ARE grabbing data from your own spreadsheet, what happens if you’ve reordered the rows in the meantime? Data Desk will allow you to paste the variable (because the number of cases is the same), but the data will be mismatched.

Fortunately it takes just a couple of steps in Data Desk to perform what I call a ‘merge’ or ‘join’ using the Lookup and GetCase functions. Sometimes I call it stitching data together – that’s what it feels like I’m doing.

I’ll show you how in: “Adding new variables to existing datafile in Data Desk, Part 2.

Older Posts »

The Silver is the New Black Theme. Create a free website or blog at


Get every new post delivered to your Inbox.

Join 1,212 other followers