# CoolData blog

## 3 September 2015

### Testing associations between two categorical variables

Over the coming months I will be adapting various bits from a work in progress, a new book on building predictive models using regression, aimed at people working in nonprofit and higher education advancement.

(If you’re interested, I suggest subscribing via email — see the box to the right — to have the inside track on this project.)

In my previous post, I wrote about “Exploring associations between variables.” Today I go deeper to describe exactly how to test for an association or relationship between two variables that are categorical.

These directions are specific to the statistics software Data Desk, but apply equally to any stats package. I make frequent reference to contingency tables, which you may know better as cross-tabs or R x C tables (row by column tables).

As well, I mention a data file from “Data University.” This is a file of anonymized sample data provided to me by Peter Wylie and John Sammis, which readers of my book will be able to download and use to learn the concepts.

In your data file for Data University, you’ve got a variable for “Business phone present” and “Job title present”. You’d like to know if these two factors are related to each other. If they are completely unrelated, they are said to be independent of each other. We can analyze this in Data Desk using contingency tables. (If you don’t know why we care about dependence or independence among variables, read my previous post.)

Click once on the icon for ‘BusPhone Present’ to select it as the first variable, then Shift-click on ‘JobTitle Present’ to select it as the second variable. Under the Calc menu at the top of the screen, select Contingency Tables. The displayed result will depend on what default settings are in effect, but it might look like the table below. (Just as with frequency tables, you can play with display options from a menu available by clicking on the little triangle in the top left corner of the window.)

A contingency table gives us counts of how many cases fall into the categories that result from the intersection of two categorical variables. In this example, we have two binary variables, therefore there are four possibilities, arranged in a two-by-two matrix. The two “levels” of ‘BusPhone Present’ are labeled vertically (down the rows), and the two ‘levels’ of ‘Job Title Present’ are labeled horizontally (across the columns).

The labeling of rows and columns in contingency tables can be a bit confusing when the category names are just ones and zeroes, so here’s the same table with more descriptive labels.

If these variables are properly coded, with no missing values, each and every case in our data falls into one of the four cells of the matrix. We see for example that 2,240 cases are ‘0’ for ‘BusPhone Present’ and ‘0’ for ‘JobTitle Present’ – a large number of cases have neither piece of information coded in the database.

Looking at counts of cases in a contingency table can sometimes show cells with unusually many or unusually few cases. In this example, it seems to be fairly common that whenever Data University has a business phone number, it tends to have a job title as well, and it’s also common for both to be missing. It’s much less common for one to be present and the other not. This suggests there’s a relationship.

We’ve made a judgment call, just looking at the numbers. Fortunately the field of statistics provides some tests for making decisions more consistently and easily. If you have some training in stats you will readily grasp how we can apply these tests. For our purposes, we will rely on rules of thumb and simplified explanations rather than delve into a deep understanding of the tests themselves.

Click on the options menu for the contingency table, and add “Expected value” to the table, then click OK. This adds a new set of numbers to the table. The section at the bottom of the table, “table contents,” lists these elements in the order they appear in each cell. The first element is the count of cases that we’ve already seen. The other element is Expected Values.

“Expected values” are the counts of cases that would be expected to result if ‘BusPhone Present’ and ‘JobTltle Present’ were statistically independent of each other. In other words, if the two variables were completely unrelated, the probability that any one case would end up in a particular cell depends only on the probability that the case falls in a specified row and the probability that it falls in a specified column. In other words, we would expect the number of people with a business phone number AND who have a job title in the database to depend solely on the number of each in the sample – one doesn’t depend on the other. By this logic, the probability of a person having a business phone in the database is about the same whether or not they have a job title in the database, if it is true that one is not related to the other.

Have a quick look to compare each pair of numbers. You’ll notice that two actual counts are higher than the expected value, and two counts are lower than the expected value. If the variables were not associated with each other (independent), the actual counts would be a lot closer to the expected counts. It appears that if one variable is a zero or one, the other variable is more likely than not to have the same value – that is, we are likely to have either both or neither. Therefore, the variables do seem to be related.

Being able to compare the expected values to the observed values is helpful in identifying a pattern or relationship. But again, we are still just eyeballing the numbers and making a guess. Let’s take another step forward.

An assertion that two variables are not related is based on a concept in statistics called the null hypothesis. The null hypothesis in this case would state, “There is no relationship between Business Phone Present and Job Title Present.” In a test for independence between Business Phone and Job Title, if there is no relationship, then the null hypothesis is true. The alternative hypothesis is that the variables are in fact related – they are dependent, rather than independent.

The test for determining whether the null hypothesis is true in this case is called the chi-square test for independence. For our purposes, it is not necessary to formally state any kind of hypothesis, nor is it necessary to understand how chi-square is calculated. (Chi is pronounced “kai”.) Data Desk takes care of the calculations, and a rule of thumb for interpreting the result will suffice. (1)

Go back to the options for the contingency table. Deselect “Expected value”, and choose “Chi-square value” instead. Data Desk will then calculate the chi-square statistic. If it’s large, that is enough to reject the null hypothesis. This number is not interpreted on its own. Rather, the important statistic to watch for appears directly underneath it: p. The p-value in the table below is less than or equal to 0.0001 – a very small value.

The lowercase p stands for “probability.” A p-value of 0.0001 is the same as a 0.01% chance of something happening. We ask this question: “If there were no relationship between ‘BusPhone Present’ and “JobTitle Present’, what is the probability that you would get a result for chi-square at least as high as 1,141?” That probability is your p-value.

By convention, you need a p-value of 0.05 (i.e. 5 percent) or less to consider the probability low enough to conclude that there must, in fact, be a relationship between ‘BusPhone Present’ and ‘JobTitle Present’. (In other words, you can reject the null hypothesis.) The probability of obtaining a value of chi-square this large if the variables were independent is extremely low: less than 0.01%. The p-value in this case is very small, therefore the variables are strongly related.

We will encounter p-values again when we do regression. (2)

Given the foregoing discussion, you might be thinking that exploring relationships among variables is a complicated and subtle business. Not really. You don’t have to study expected values or formulate a null hypothesis – I introduced those things to help you understand where chi-square comes from. You only need these steps to test for presence of a relationship:

1. Select the icons of the two variables.
2. Create a contingency table from the Calc menu.
3. In the table options menu, select “Chi-square value.”
4. Regardless of the chi-square value, if the p-value is less than 0.05, the variables are not independent – they are related somehow.

You can create a new contingency table whenever you want to test two new variables, or you can simply drag a new category variable into an existing table. Just drag a new categorical variable icon on top of the name of the variable you want to replace. These names are in bold face text near the top of the table window, and will highlight when touched by the cursor during the drag. You can replace both factors at once by selecting two new variables and dragging them both into the centre of the table.

Before we move on, try dragging other binary variables into the contingency table to quickly see which combinations yield some interesting associations.

A more relevant example

Dragging variables one by one into a contingency table is a useful exploration step while evaluating potential predictors for use in a model. This implies that one of the variables must be the target (outcome) variable.

Let’s say you plan to build a predictive model to identify which alumni are more likely to make a donation at a higher level than usual, and you want to make a preliminary assessment of potential predictors so you can toss out any that don’t look promising, while keeping the others for use in the regression analysis. You can create a target variable using a lifetime giving variable that will allow you to do this analysis using contingency tables.

Working with the Data University file, create a new derived variable called ‘Big donor’ with the expression:

This will create a binary variable that evaluates to ‘1’ for alumni with lifetime hard-credit giving greater than \$999, and ‘0’ for everyone else.

You can use whatever dollar value you like, but \$1,000 and up will work here. According to a frequency table of ‘Big donor’, 602 people have lifetime giving of \$1,000 or more.

Click on the icon for ‘Big donor’ to select it and create a contingency table from the Calc menu, and change the table options to include chi-square. Now the window is ready for testing variables one by one, by dragging variables into the space that is initially labeled “Drag Variable Here.”

Chi-square is great for indicating that two variables are related, but a little more information will help you understand the nature of the relationship. When you drag in “BusPhone Present,” for example, it seems apparent that the relationship is significant, to judge from the chi-square value. One tweak will tell you something more concrete: Go back to the table options, deselect “Count” and select “Percent of column total.”

This replaces cell counts with percent values – the percent breakdown for each column. The circled values in the table above are the ones to pay attention to – they are the cases that have a ‘1’ for ‘Big donor’. Here is what we can read from this:

• The first column of figures includes all cases for which there is no business phone in the database (0). Only 8.39% of people with no business phone are also top donors.
• The second column includes all cases with a business phone (1) – 18.4% of people with a business phone are top donors.

People with a business phone are more than twice as likely to be top donors, and the chi-square value indicates that the relationship is significant. It might be hasty to conclude that ‘BusPhone Present’ is truly a “predictor” of high-level giving, but the association is clearly there in the data.

(These percentages total on the columns. If ‘Big donor’ were your second variable instead of your first, the matrix would be ordered in the other direction, and you would choose “Percent of row total” from the table options instead.)

Try dragging other binary variables from the Data University set into the table, replacing ‘BusPhone Present’, and observe how the percent breakdowns differ from group to group, and whether that difference is significant.

In my book-in-progress, I go on to talk about exploring categorical variables with multiple categories – not just binary variables – but that’s all for today. Again, if you’re interested in knowing more as this project progress, please subscribe to this blog via email from the box in the right sidebar.

End notes

1) If you must know, the value of chi-square is the sum of the squared standardized residuals across all cells in the table. The standardized residual describes the extent to which the observed count differs from the expected count. You can display this statistic in Data Desk along with the others, but there is no need. Another abbreviation in the table is “df”, which stands for degrees of freedom. Probability values of chi-square depend upon the degrees of freedom, which in turn is related to the number of rows and columns in the table.

2) Use and misuse of p-values is a hot topic in statistics, especially in connection with academic publishing, where putting too much stock in p-values and the rather arbitrary 0.05 threshold causes a great deal of angst. As predictive modelers we need to keep these controversies in perspective: We are not testing the effects of new medical treatments nor are we seeking to publish in a scientific journal. We are making use of “good-enough” tools that will produce a useful result. That said, applying the 0.05 threshold without judgment and common sense means that we will detect “relationships” that are not real – potentially about 5% of the time!

## Guest post by John Sammis and Peter B. Wylie

Thanks to all of you who read and commented on our recent paper comparing logistic regression with multiple regression. We were not sure how popular this topic would be, but Kevin told us that interest was high, and there were a number of comments and questions. There were several general themes in the comments; Kevin has done an excellent job responding, but we thought we should throw in our two cents.

Why not just use logistic?

The point of our paper was not to suggest that logistic regression should not be used — our point was that multiple regression can achieve prediction results quite similar to logistic regression. Based on our experience working with and training fundraising professionals getting introduced to analytics, logistic regression can be intimidating. Our goal is always to get these folks to use analytics to help with their fundraising initiatives. We find many of them catch on with multiple regression, and much less so with logistic regression.

Predicted values vs. probabilities

We understand that the predicted values generated by multiple regression are different from the probabilities generated by logistic regression. Regardless of the statistic modeling technique we use, we always bin the raw prediction or probability values into equal-sized score levels. We have found that score level bins are easier to use than raw values. And using equal-sized score levels allows for easier evaluation of the scoring model.

“I cannot agree”

Some commenters, knowledgeable about statistics, said they would not use multiple regression when the inputs called for logistic. According to the rules, if the target variable is binary, then linear modelling doesn’t make sense — and the rules must be obeyed. In our view, this rigid approach to method selection is inappropriate for predictive modelling. The use of multiple linear regression in place of logistic regression may not always make theoretical sense, but predictive modellers are concerned with whether or not a model produces an output that is useful in practical terms. The worth of a model is testable against new, real-world data, therefore a model has only one criterion for determining “appropriate” use: Whether it really predicts what the modeler claims it will predict. The truth is revealed during evaluation.

A modest proposal

No one reading this should simply take our word that these two dissimilar methods yield similar results. Neither should anyone dismiss it out of hand without providing a critique based on real data. We would encourage anyone to try doing something on your own with data using both techniques and show us what you find. In particular, graduate students looking for a thesis or dissertation topic might consider producing something under this title: “Comparing Logistic Regression and Multiple Regression as Techniques for Predicting Major Giving.”

Heck! Peter says that if anyone were interested in doing a study like this for a thesis or dissertation, he would be willing to offer advice on how to:

1. Do a thorough literature review
2. Formulate specific research questions
3. Come up with a study design
4. Prepare a proposal that would satisfy a thesis or dissertation committee.

That’s quite an offer. How about it?

## 29 June 2010

### Choosing the right flavour of regression

Filed under: Model building, regression, Statistics, Terminology, Uncategorized — Tags: , , — kevinmacdonell @ 5:50 am

(Creative Commons license. Click image for source.)

I use two types of regression analysis to build predictive models: multiple linear regression and binary logistic regression. Both are called “regression”, but they are very different animals. You can use either one to build a model, but which one is best for fundraising models?

The answer is that there is no best option that applies across the board. It depends on what you’re trying to predict, certainly, but even more so it depends on the data itself. The best option will not be obvious and will be revealed only in testing. I don’t mean to sound careless about proper statistical practice, but we work in the real world: It’s not so much a question of “which tool is most appropriate?” as “Which tool WORKS?”

One of the primary differences between the two types of regression is the definition of the dependent variable. In logistic regression, this outcome variable is either 1 or 0. (There are other forms of logistic regression with multiple nominal outcomes, but I’ll stick to binary outcomes for now.) An example might be “Is a donor / Is not a donor,” or “Is a Planned Giving expectancy/ Is not a planned giving expectancy.”

In multiple regression, the dependent variable is typically a continuous value, like giving expressed in real dollars (or log-transformed dollars). But the DV can also be a 0/1 value, just as in logistic regression. Technically using a binary variable violates one of the assumptions underlying multiple regression (a normal probability distribution of the DV), but that doesn’t necessarily invalidate the model as a powerful predictive tool. Again, what works?

Another difference, less important to my mind, is that the output of a multiple regression analysis is a predicted value that reflects the units (say, dollars) of the DV and may be interpretable as such (predicted lifetime giving, predicted gift amount, etc.), while the output of a logistic regression is a probability value. My practice is to transform both sorts of outputs into scores (deciles and percentiles) for all individuals under study; this allows me to refer to both model outputs simply as “likelihood” and compare them directly.

So which to use? I say, use both! If you want some extra confidence in the worth of your model, it isn’t that much trouble to prepare both score sets and see how they compare. The key is having a set of holdout cases that represent the behaviour of interest. If your model is predicting likelihood to become a Planned Giving expectancy, you first set aside some portion of existing PG expectancies, build the model without them, then see how well the model performed at assigning scores to that holdout set.

You can use this method of validation when you create only one model, too. But where do you set the bar for confidence in the model if you test only one? Having a rival model to compare with is very useful.

In my next post I will show you a real-world example, and explain how I decided which model worked best.

## 9 June 2010

### Why multiple regression?

Filed under: Model building, regression, Statistics — Tags: , , , — kevinmacdonell @ 6:11 am

Not long ago I wrote about Pearson’s r, also known as Pearson’s Product-Moment Correlation Coefficient. This is a convenient statistical tool available in any stats software program (Excel can calculate it too) that yields a numerical measure of the strength of the correlation (linear dependence) between any two variables, X and Y.

I will show you how you can easily create a predictive score using only Pearson’s r — and why you probably shouldn’t!

Pearson’s r points the way toward weighting your predictor variables, according to how strongly correlated with your predicted value they actually are. If you assume all your predictor variables are worth the same (“1” for positive predictors and “-1” for negative predictors) you are imposing a subjective weighting on your variables. (I wrote about this limitation of the simple score in my most recent post, Beyond the simple score.)

Is Homecoming attendance more or less predictive than presence of an email address? Pearson’s r will tell you. Look at the example in the table above. Ten common predictor variables are listed in order of their strength of correlation with Lifetime Giving (log-transformed). It just so happens that the highest two correlations are negative; even though they are negative, according to their absolute value they are stronger correlations than any of the others in the list, so I put them at the top.  According to these values, “Email present” is relatively weak compared with “Number of Homecomings attended.”

Do you see where this is going? If you wanted to, you could use these correlations to directly create weighted scores for everyone in your database. Just multiply each variable by its Pearson value, sum up the products, and bingo — there’s your score.

You could do that, but I don’t think you should.

At least two of the predictor variables we want to use in our score are very closely related to each other: Employer present and Position present. They aren’t exactly alike: For some constituents you will have one piece of information and not the other. But on the whole, if one is present in your database for any given constituent, chances are you’ve got the other as well.

In other words, if you include both variables in your score, you’re double-counting the effect of employment information in your model — despite the fact that each is properly weighted by Pearson score. The reason is that Pearson’s r treats only two variables at a time: X and Y. It does NOT account for any interactions between multiple Xs.

Employment variables are only an obvious example. All of your variables will interact with each other to some degree, some strongly, others more subtly. By “interact with each other” I mean “correlate with each other.” In fact, we can use Pearson’s r to show which combinations of predictor variables are strongly correlated. The table below lists three variable pairs, drawn from real data, that exhibit strong interactions — including the employment example we’ve just mentioned.

The Pearson value for the employment variables is very close to 1, which indicates a nearly perfect positive correlation. The other two are more subtle, but make sense as well: Younger alumni will tend to be coded as Single in your database, and if we have a job title for someone, chances are we’ll also have a business phone number as well.

This overlapping of the explanatory effect of various X’s on Y will interfere with our ability to properly weight our predictors. Pearson’s Product-Moment Correlation Coefficient is important for understanding our variables, but not quite up to the task of directly created predictive scores. What now?

Well — multiple regression! Only regression will account for interactions among our predictor Xs, recalculating the coefficients (weightings) on the fly each time we add a new predictor variable. Working from the Pearson list at the top of this post, we would add Class year, Single, and Employer present to our regression window one by one. Everything would be fine up to that point; our p-values will be very low for these variables. When we add Position present, however, the p-value will be too high (0.183, which exceeds the rule-of-thumb value of 0.05), and R squared will fail to improve. We would therefore leave Position present out of the regression because it isn’t adding any new predictive information to the model and might interfere with the effectiveness of other variables in subtle and strange ways.

Often when I use the word “regression” on someone, what I see reflected back in their eyes is fear. (I really need to reserve that word for people I don’t like.) I wish, though, that people could see that regression is a bit like an automobile: A complex machine with many moving parts, but familiar and approachable, with a simple and comprehensible purpose, and above all operable by just about anyone.

## 28 April 2010

### Pearson product-moment correlation coefficient

Filed under: Data Desk, Model building, regression, Statistics — Tags: , , , — kevinmacdonell @ 8:54 am

You can’t have a serious blog post related to statistics without tossing in the name of a dead white guy. How dead? Well, yesterday was the 74th anniversary of Karl Pearson‘s death in Surrey, England. How white? Pearson was a fan of eugenics, social Darwinism, and the “struggle of race with race” – with the supposedly best race winning. Charming! Like him or not, his contribution to statistics (and therefore science) was huge.

I’ve written a bit about multiple linear regression, but not a great deal about how to do it. After I’ve got my data file ready and before I open up a regression window in Data Desk, ranking my predictor variables using ol’ Pearson’s measure of correlation is Step One. The object is to rank your predictor variables (a.k.a. independent variables) according to the strength (either positive or negative) of their linear correlation with your predicted value (a.k.a. dependent variable). I do this in order to determine the order in which I will introduce my predictor variables into my regression analysis. (All of the following discussion assumes that your variables are numerical; either continuous variables such as ‘class year’, or 0/1 indicator variables you’ve created from your categorical variables.)

The tool I’ve been taught to use is Pearson’s Product-Moment Correlation Coefficient, also called Pearson’s r. This is a quantitative tool which yields a coefficient that describes the slope of the line (as can best be determined) between your Y variable (say, ‘giving’) and one of your X’s (say, ‘class year’). A value of 1 denotes a perfectly linear correlation in a positive direction, and a value of -1 is a perfect negative correlation. All possible values fall between 1 and minus 1. Values near zero denote the absence of a linear correlation; they might be correlated in some other way, but not linearly.

(Click for larger view.)

A scatterplot of the two variables ‘giving‘ and ‘class year’, shown here, will reveal a relationship visually: ‘giving’ tends to decrease as ‘class year’ increases. This negative linear relationship is indicated by the downward-sloping line. The Pearson Correlation reveals the same thing, but not visually: it puts an actual number to it, and that number describes the slope of the line. Why is this important? Because for many of your variables, a scatterplot is just going to look like a mess – the linear relationship is in there somewhere, but it’s not evident from the cloud of points. If you have a calculated value instead, you can easily decide which linear relationships demand priority attention.

It’s easy to do in Data Desk. Just select the icon for ‘giving’ as your Y and also select all of your predictor variables (x), then go to Calc in the menu. Select Correlations, then Pearson Product-Moment. If you have a lot of variables, the resulting table will be impressively large. It will look fearsome or beautiful, depending on how you feel about being faced with a wall of numbers. (I think it’s gorgeous.) To find the value that relates one variable to another, find the intersection of the row and column of the two variables. For example, in the table below, the Pearson correlation value for ‘giving’ and ‘class year’ is -0.460. (The correlation of any variable with itself is, of course, a perfect 1.)

Have a look around this table. Don’t be concerned about the actual values. Just see which values are higher than others. For instance, look at the intersection of “position present” and “employer present“. It’s a very high value: 0.812, which is very close to 1! This tells us that these two predictor variables are going to “interact” with each other when we bring them into the regression analysis. It makes sense: Job title and employer name are likely to be present or absent in tandem, although not perfectly. The practical result is that one of these variables will prove to be a significant predictor, while the other adds little or nothing new, and will be left out.

So how do I decide which variable gets added before the other? It’s simple.

The only part of the whole Pearson table that we’re interested in is the column of values under the heading ‘giving’. Data Desk allows us to copy the table as text and paste it into Excel. When I do this and strip out all the stuff I’m not interested in, the result looks like this. (I’ve resorted the variables alphabetically).

Next, I sort the variables according to their Pearson correlation with Giving. The variables with the highest values will head the list. But notice a small problem: The strongest NEGATIVE variables end up at the very bottom. Really, with its high correlation with ‘giving’, the class year variable should rank first. So I do one extra step, creating a column with an Excel formula for the absolute value of the Pearson coefficient (i.e. without the minus sign), and re-sort on that value.

This gives me a clear idea of the order in which I should add variables to the regression. For example, ’employer present’ seems to edge out ‘position present’. Due to variable interaction, though, the final roster of which variables will stay and which will go is NOT evident at this point. The proof is in the regression, where all sorts of interesting and unforeseeable interactions may crop up.

You don’t have to take this manual approach to adding your variables – your software probably offers an automated, or partially automated, method called stepwise regression. But after all the work of preparing my predictors, I enjoy watching the way they interact with each other as I work through training the model. The way I see it, the more hands-on you are with your analysis, the more you learn about your data.

Final note: The examples above actually use a transformed value of ‘giving’ – the log of giving. Transforming our dependent variable using a logarithmic function is a perfectly valid way to make the linear relationships among variables much more evident. (Why we transform variables is explained more fully here.) If we used ‘giving’ just as it is, the Pearson values would be very low, which would indicate only a very weak linear relationship. Even ‘class year’ would have a low value, which we know isn’t a good description of the reality, which is better represented by the scatterplot above.