CoolData blog

3 September 2015

Testing associations between two categorical variables


Over the coming months I will be adapting various bits from a work in progress, a new book on building predictive models using regression, aimed at people working in nonprofit and higher education advancement.


(If you’re interested, I suggest subscribing via email — see the box to the right — to have the inside track on this project.)


In my previous post, I wrote about “Exploring associations between variables.” Today I go deeper to describe exactly how to test for an association or relationship between two variables that are categorical.


These directions are specific to the statistics software Data Desk, but apply equally to any stats package. I make frequent reference to contingency tables, which you may know better as cross-tabs or R x C tables (row by column tables).


As well, I mention a data file from “Data University.” This is a file of anonymized sample data provided to me by Peter Wylie and John Sammis, which readers of my book will be able to download and use to learn the concepts.


In your data file for Data University, you’ve got a variable for “Business phone present” and “Job title present”. You’d like to know if these two factors are related to each other. If they are completely unrelated, they are said to be independent of each other. We can analyze this in Data Desk using contingency tables. (If you don’t know why we care about dependence or independence among variables, read my previous post.)


Click once on the icon for ‘BusPhone Present’ to select it as the first variable, then Shift-click on ‘JobTitle Present’ to select it as the second variable. Under the Calc menu at the top of the screen, select Contingency Tables. The displayed result will depend on what default settings are in effect, but it might look like the table below. (Just as with frequency tables, you can play with display options from a menu available by clicking on the little triangle in the top left corner of the window.)




A contingency table gives us counts of how many cases fall into the categories that result from the intersection of two categorical variables. In this example, we have two binary variables, therefore there are four possibilities, arranged in a two-by-two matrix. The two “levels” of ‘BusPhone Present’ are labeled vertically (down the rows), and the two ‘levels’ of ‘Job Title Present’ are labeled horizontally (across the columns).


The labeling of rows and columns in contingency tables can be a bit confusing when the category names are just ones and zeroes, so here’s the same table with more descriptive labels.




If these variables are properly coded, with no missing values, each and every case in our data falls into one of the four cells of the matrix. We see for example that 2,240 cases are ‘0’ for ‘BusPhone Present’ and ‘0’ for ‘JobTitle Present’ – a large number of cases have neither piece of information coded in the database.


Looking at counts of cases in a contingency table can sometimes show cells with unusually many or unusually few cases. In this example, it seems to be fairly common that whenever Data University has a business phone number, it tends to have a job title as well, and it’s also common for both to be missing. It’s much less common for one to be present and the other not. This suggests there’s a relationship.


We’ve made a judgment call, just looking at the numbers. Fortunately the field of statistics provides some tests for making decisions more consistently and easily. If you have some training in stats you will readily grasp how we can apply these tests. For our purposes, we will rely on rules of thumb and simplified explanations rather than delve into a deep understanding of the tests themselves.


Click on the options menu for the contingency table, and add “Expected value” to the table, then click OK. This adds a new set of numbers to the table. The section at the bottom of the table, “table contents,” lists these elements in the order they appear in each cell. The first element is the count of cases that we’ve already seen. The other element is Expected Values.




“Expected values” are the counts of cases that would be expected to result if ‘BusPhone Present’ and ‘JobTltle Present’ were statistically independent of each other. In other words, if the two variables were completely unrelated, the probability that any one case would end up in a particular cell depends only on the probability that the case falls in a specified row and the probability that it falls in a specified column. In other words, we would expect the number of people with a business phone number AND who have a job title in the database to depend solely on the number of each in the sample – one doesn’t depend on the other. By this logic, the probability of a person having a business phone in the database is about the same whether or not they have a job title in the database, if it is true that one is not related to the other.


Have a quick look to compare each pair of numbers. You’ll notice that two actual counts are higher than the expected value, and two counts are lower than the expected value. If the variables were not associated with each other (independent), the actual counts would be a lot closer to the expected counts. It appears that if one variable is a zero or one, the other variable is more likely than not to have the same value – that is, we are likely to have either both or neither. Therefore, the variables do seem to be related.


Being able to compare the expected values to the observed values is helpful in identifying a pattern or relationship. But again, we are still just eyeballing the numbers and making a guess. Let’s take another step forward.


An assertion that two variables are not related is based on a concept in statistics called the null hypothesis. The null hypothesis in this case would state, “There is no relationship between Business Phone Present and Job Title Present.” In a test for independence between Business Phone and Job Title, if there is no relationship, then the null hypothesis is true. The alternative hypothesis is that the variables are in fact related – they are dependent, rather than independent.


The test for determining whether the null hypothesis is true in this case is called the chi-square test for independence. For our purposes, it is not necessary to formally state any kind of hypothesis, nor is it necessary to understand how chi-square is calculated. (Chi is pronounced “kai”.) Data Desk takes care of the calculations, and a rule of thumb for interpreting the result will suffice. (1)


Go back to the options for the contingency table. Deselect “Expected value”, and choose “Chi-square value” instead. Data Desk will then calculate the chi-square statistic. If it’s large, that is enough to reject the null hypothesis. This number is not interpreted on its own. Rather, the important statistic to watch for appears directly underneath it: p. The p-value in the table below is less than or equal to 0.0001 – a very small value.




The lowercase p stands for “probability.” A p-value of 0.0001 is the same as a 0.01% chance of something happening. We ask this question: “If there were no relationship between ‘BusPhone Present’ and “JobTitle Present’, what is the probability that you would get a result for chi-square at least as high as 1,141?” That probability is your p-value.


By convention, you need a p-value of 0.05 (i.e. 5 percent) or less to consider the probability low enough to conclude that there must, in fact, be a relationship between ‘BusPhone Present’ and ‘JobTitle Present’. (In other words, you can reject the null hypothesis.) The probability of obtaining a value of chi-square this large if the variables were independent is extremely low: less than 0.01%. The p-value in this case is very small, therefore the variables are strongly related.


We will encounter p-values again when we do regression. (2)


Given the foregoing discussion, you might be thinking that exploring relationships among variables is a complicated and subtle business. Not really. You don’t have to study expected values or formulate a null hypothesis – I introduced those things to help you understand where chi-square comes from. You only need these steps to test for presence of a relationship:


  1. Select the icons of the two variables.
  2. Create a contingency table from the Calc menu.
  3. In the table options menu, select “Chi-square value.”
  4. Regardless of the chi-square value, if the p-value is less than 0.05, the variables are not independent – they are related somehow.


You can create a new contingency table whenever you want to test two new variables, or you can simply drag a new category variable into an existing table. Just drag a new categorical variable icon on top of the name of the variable you want to replace. These names are in bold face text near the top of the table window, and will highlight when touched by the cursor during the drag. You can replace both factors at once by selecting two new variables and dragging them both into the centre of the table.


Before we move on, try dragging other binary variables into the contingency table to quickly see which combinations yield some interesting associations.


A more relevant example


Dragging variables one by one into a contingency table is a useful exploration step while evaluating potential predictors for use in a model. This implies that one of the variables must be the target (outcome) variable.


Let’s say you plan to build a predictive model to identify which alumni are more likely to make a donation at a higher level than usual, and you want to make a preliminary assessment of potential predictors so you can toss out any that don’t look promising, while keeping the others for use in the regression analysis. You can create a target variable using a lifetime giving variable that will allow you to do this analysis using contingency tables.


Working with the Data University file, create a new derived variable called ‘Big donor’ with the expression:


‘Lifetime HC’ > 999


This will create a binary variable that evaluates to ‘1’ for alumni with lifetime hard-credit giving greater than $999, and ‘0’ for everyone else.


You can use whatever dollar value you like, but $1,000 and up will work here. According to a frequency table of ‘Big donor’, 602 people have lifetime giving of $1,000 or more.


Click on the icon for ‘Big donor’ to select it and create a contingency table from the Calc menu, and change the table options to include chi-square. Now the window is ready for testing variables one by one, by dragging variables into the space that is initially labeled “Drag Variable Here.”


Chi-square is great for indicating that two variables are related, but a little more information will help you understand the nature of the relationship. When you drag in “BusPhone Present,” for example, it seems apparent that the relationship is significant, to judge from the chi-square value. One tweak will tell you something more concrete: Go back to the table options, deselect “Count” and select “Percent of column total.”




This replaces cell counts with percent values – the percent breakdown for each column. The circled values in the table above are the ones to pay attention to – they are the cases that have a ‘1’ for ‘Big donor’. Here is what we can read from this:


  • The first column of figures includes all cases for which there is no business phone in the database (0). Only 8.39% of people with no business phone are also top donors.
  • The second column includes all cases with a business phone (1) – 18.4% of people with a business phone are top donors.


People with a business phone are more than twice as likely to be top donors, and the chi-square value indicates that the relationship is significant. It might be hasty to conclude that ‘BusPhone Present’ is truly a “predictor” of high-level giving, but the association is clearly there in the data.


(These percentages total on the columns. If ‘Big donor’ were your second variable instead of your first, the matrix would be ordered in the other direction, and you would choose “Percent of row total” from the table options instead.)


Try dragging other binary variables from the Data University set into the table, replacing ‘BusPhone Present’, and observe how the percent breakdowns differ from group to group, and whether that difference is significant.


In my book-in-progress, I go on to talk about exploring categorical variables with multiple categories – not just binary variables – but that’s all for today. Again, if you’re interested in knowing more as this project progress, please subscribe to this blog via email from the box in the right sidebar.


End notes


1) If you must know, the value of chi-square is the sum of the squared standardized residuals across all cells in the table. The standardized residual describes the extent to which the observed count differs from the expected count. You can display this statistic in Data Desk along with the others, but there is no need. Another abbreviation in the table is “df”, which stands for degrees of freedom. Probability values of chi-square depend upon the degrees of freedom, which in turn is related to the number of rows and columns in the table.


2) Use and misuse of p-values is a hot topic in statistics, especially in connection with academic publishing, where putting too much stock in p-values and the rather arbitrary 0.05 threshold causes a great deal of angst. As predictive modelers we need to keep these controversies in perspective: We are not testing the effects of new medical treatments nor are we seeking to publish in a scientific journal. We are making use of “good-enough” tools that will produce a useful result. That said, applying the 0.05 threshold without judgment and common sense means that we will detect “relationships” that are not real – potentially about 5% of the time!



26 August 2015

Exploring associations between variables

Filed under: Book, CoolData, Predictor variables — Tags: , , , — kevinmacdonell @ 6:57 pm


CoolData has been quiet over the summer, mainly because I’ve been busy writing another book. (Fine weather has a bit to do with it, too.) The book will be for nonprofit and higher education advancement professionals interested in learning how to use multiple regression to build predictive models. Over the next few months, I will adapt various bits from the work-in-progress as individual posts here on CoolData.


I’ll have more to say about the book later, so if you’re interested, I suggest subscribing via email (see the box to the right) to have the inside track on this project. (And if you aren’t familiar with the previous book, co-written with Peter Wylie, then have a look here.)


A generous chunk of the book is about the specifics of getting your hands dirty with cleaning up your messy data, transforming it to make it suitable for regression analysis, and exploring it for interesting patterns that can strengthen a predictive model.


When you import a data set into Data Desk or other statistics package, you are looking at more than just a jumble of variables. All these variables are in a relation; they are linked by patterns. Some variables are strongly associated with each other, others have weaker associations, and some are hardly related to each other at all.


What is meant by “association”? A classic example is a data set of children’s weights, heights, and ages. Older children tend to weigh more and be taller than younger children. Heavier children tend to be older and taller than younger children. We say that there is an association between age and weight, between height and weight, and between age and height.


Another example: Alumni who are bigger donors tend to attend more reunion events than alumni who give more modestly or don’t give at all. Or put the other way, alumni who attend more events tend to give more than alumni who attend fewer or no events. There is an association between giving and attending events.


This sounds simple enough — even obvious. The powerful consequence of these truths is that if we know the value of one variable, we can make a guess at the value of another, as long as the association is valid. So if we know a child’s weight and height, we can make a good guess of his or her age. If we know a child’s height, we can guess weight. If we know how many reunions an alumna has attended, we can make a guess about her level of giving. If we know how much she has given, we can guess whether she’s attended more or fewer reunions than other alumni.


We are guessing an unknown value (say, giving) based on a known value (number of events attended). But note that “giving” is not really an unknown. We’ve got everyone’s giving recorded in the database. What is really unknown is an alum’s or a donor’s potential for future giving. With predictive modeling, we are making a guess at what the value of a variable will be in the (near) future, based on the current value of other variables, and the type and degree of association they have had historically.


These guesses will be far from perfect. We aren’t going to be bang-on in our guesses of children’s ages based on weight and height, and we certainly aren’t going to be very accurate with our estimates of giving based on event attendance. Even trickier, projecting into the future — estimating potential — is going to be very approximate.


Still, our guesses will be informed guesses, as long as the associations we detect are real and not due to random variation in our data. Can we predict exactly how much each donor is going to give over this coming year? No, that would be putting too much confidence in our powers. But we can expect to have plenty of success in ranking our constituents in order by how likely they are to engage in whatever behaviour we are interested in, and that knowledge will be of great value to the business.


Looking for potentially useful associations is part of data exploration, which is best done in full hands-on mode! In a future post I will talk about specific techniques for exploring different types of variables.


19 August 2014

Score! … As pictured by you

Filed under: Book, Peter Wylie, Score! — Tags: , , — kevinmacdonell @ 7:25 pm
2014-07-18 06.41.41

Left to right: Elisa Shoenberger, Leigh Petersen Visaya, Rebekah O’Brien, and Alison Rane in Chicago. (Click for full size.)

During the long stretch of time that Peter Wylie and I were writing our book, Score! Data-Driven Success for Your Advancement Team, there were days when I thought that even if we managed to get the thing done, it might not be that great. There were just so many pieces that needed to fit together somehow … I guess we each didn’t want to let the other down, so we plugged on despite doubts and delays, and then, somehow, it got finished.

Whew, I thought. Washed my hands of that! I expected I would walk away from it,  move on to other projects, and be glad that I had my early mornings and weekends back.

That’s not what happened.

These few months later, my eye will still be caught now and then by the striking, colourful cover of the book sitting on my desk. It draws me to pick it up and flip through it — even re-read bits. I find myself thinking, “Hey, I like this.”

Of course, who cares, right? I am not the reader. However, whatever I might think about Score!, it has been even more gratifying for Peter and I to hear from folks who seem to like it as much as we do. How fun it has been to see that bright cover popping up in photos and on social media every once in a while.

I’ve collected a few of those photos and tweets here, along with some other images related to the book. Feel free to post your own “Score selfies” on Twitter using the hashtag #scorethebook. Or if you’re not into Twitter, send me a photo at

Click here to order your copy of Score! from the CASE Bookstore.

2014-10-03 19.37.25

2014-09-23 11.56.58

2014-08-14 06.34.25


Jennifer Cunningham, Senior Director, Metrics+Marketing for the Office of Alumni Affairs, Cornell University. @jenlynham

Click here to order your copy of Score! from the CASE Bookstore.

While we would like for you to buy it, we would LOVE for you to read it and put it to work in your shop. Your buying it earns us each enough money to buy a cup of coffee. Your READING it furthers the reach and impact of ideas and concepts that fascinate us and which we love to share.

22 April 2014

Score! ships tomorrow

Filed under: Book, Score! — Tags: , , — kevinmacdonell @ 7:29 pm

scoreThe printer delivered early, and a copy of Score! showed up at CASE headquarters in Washington DC this afternoon.

(Doug Goldenberg-Hart, CASE’s Director, Editorial Projects sent this photo to prove it.)

To everyone who put in an advance order, your copy will be available to ship tomorrow (Wednesday).

Peter Wylie and I sincerely hope you enjoy it.

Click here to order.


















23 December 2013

New from CASE Books: Score!

Filed under: Book, CoolData, Peter Wylie — Tags: , , , — kevinmacdonell @ 9:39 am

CASE_coverAs the year draws to a close, I’m pleased to announce that the book I’ve co-written with Peter Wylie will be available in January. ‘Score!’ joins a host of fine publications in CASE’s new catalog. I’m looking forward to having a look through this catalog for new books for the office. (‘Score’ is featured on page 12.)

So what is this new book about? The full title is Score!: Data-Driven Success for Your Advancement Team, and as a recent of issue of BriefCASE notes: “Kevin MacDonell and Peter Wylie walk readers through compelling arguments for why an organization should adopt data-driven decision-making as well as explanations of basic issues such as identifying and mining the pertinent data and what operations to perform once that data is in hand.”

You can read the rest of that article here: Ready to Score!?

20 August 2013

A book cover for “Score!”

Filed under: Book, Peter Wylie — Tags: , — kevinmacdonell @ 4:45 am

It has been a long time since I’ve offered an update on “Score!”, the forthcoming book I have co-authored with Peter Wylie. I apologize for that.  I do hope that readers who have known about this project for some time will feel that it is worth the wait. The revised date of availability is sometime this fall. (If you like instant gratification from your work, I would suggest you avoid the world of book publishing.)

We do have a cover image to show you. I like the funky colours.


Older Posts »

The Silver is the New Black Theme. Blog at


Get every new post delivered to your Inbox.

Join 1,212 other followers