# CoolData blog

## 3 September 2015

### Testing associations between two categorical variables

Over the coming months I will be adapting various bits from a work in progress, a new book on building predictive models using regression, aimed at people working in nonprofit and higher education advancement.

(If you’re interested, I suggest subscribing via email — see the box to the right — to have the inside track on this project.)

In my previous post, I wrote about “Exploring associations between variables.” Today I go deeper to describe exactly how to test for an association or relationship between two variables that are categorical.

These directions are specific to the statistics software Data Desk, but apply equally to any stats package. I make frequent reference to contingency tables, which you may know better as cross-tabs or R x C tables (row by column tables).

As well, I mention a data file from “Data University.” This is a file of anonymized sample data provided to me by Peter Wylie and John Sammis, which readers of my book will be able to download and use to learn the concepts.

In your data file for Data University, you’ve got a variable for “Business phone present” and “Job title present”. You’d like to know if these two factors are related to each other. If they are completely unrelated, they are said to be independent of each other. We can analyze this in Data Desk using contingency tables. (If you don’t know why we care about dependence or independence among variables, read my previous post.)

Click once on the icon for ‘BusPhone Present’ to select it as the first variable, then Shift-click on ‘JobTitle Present’ to select it as the second variable. Under the Calc menu at the top of the screen, select Contingency Tables. The displayed result will depend on what default settings are in effect, but it might look like the table below. (Just as with frequency tables, you can play with display options from a menu available by clicking on the little triangle in the top left corner of the window.)

A contingency table gives us counts of how many cases fall into the categories that result from the intersection of two categorical variables. In this example, we have two binary variables, therefore there are four possibilities, arranged in a two-by-two matrix. The two “levels” of ‘BusPhone Present’ are labeled vertically (down the rows), and the two ‘levels’ of ‘Job Title Present’ are labeled horizontally (across the columns).

The labeling of rows and columns in contingency tables can be a bit confusing when the category names are just ones and zeroes, so here’s the same table with more descriptive labels.

If these variables are properly coded, with no missing values, each and every case in our data falls into one of the four cells of the matrix. We see for example that 2,240 cases are ‘0’ for ‘BusPhone Present’ and ‘0’ for ‘JobTitle Present’ – a large number of cases have neither piece of information coded in the database.

Looking at counts of cases in a contingency table can sometimes show cells with unusually many or unusually few cases. In this example, it seems to be fairly common that whenever Data University has a business phone number, it tends to have a job title as well, and it’s also common for both to be missing. It’s much less common for one to be present and the other not. This suggests there’s a relationship.

We’ve made a judgment call, just looking at the numbers. Fortunately the field of statistics provides some tests for making decisions more consistently and easily. If you have some training in stats you will readily grasp how we can apply these tests. For our purposes, we will rely on rules of thumb and simplified explanations rather than delve into a deep understanding of the tests themselves.

Click on the options menu for the contingency table, and add “Expected value” to the table, then click OK. This adds a new set of numbers to the table. The section at the bottom of the table, “table contents,” lists these elements in the order they appear in each cell. The first element is the count of cases that we’ve already seen. The other element is Expected Values.

“Expected values” are the counts of cases that would be expected to result if ‘BusPhone Present’ and ‘JobTltle Present’ were statistically independent of each other. In other words, if the two variables were completely unrelated, the probability that any one case would end up in a particular cell depends only on the probability that the case falls in a specified row and the probability that it falls in a specified column. In other words, we would expect the number of people with a business phone number AND who have a job title in the database to depend solely on the number of each in the sample – one doesn’t depend on the other. By this logic, the probability of a person having a business phone in the database is about the same whether or not they have a job title in the database, if it is true that one is not related to the other.

Have a quick look to compare each pair of numbers. You’ll notice that two actual counts are higher than the expected value, and two counts are lower than the expected value. If the variables were not associated with each other (independent), the actual counts would be a lot closer to the expected counts. It appears that if one variable is a zero or one, the other variable is more likely than not to have the same value – that is, we are likely to have either both or neither. Therefore, the variables do seem to be related.

Being able to compare the expected values to the observed values is helpful in identifying a pattern or relationship. But again, we are still just eyeballing the numbers and making a guess. Let’s take another step forward.

An assertion that two variables are not related is based on a concept in statistics called the null hypothesis. The null hypothesis in this case would state, “There is no relationship between Business Phone Present and Job Title Present.” In a test for independence between Business Phone and Job Title, if there is no relationship, then the null hypothesis is true. The alternative hypothesis is that the variables are in fact related – they are dependent, rather than independent.

The test for determining whether the null hypothesis is true in this case is called the chi-square test for independence. For our purposes, it is not necessary to formally state any kind of hypothesis, nor is it necessary to understand how chi-square is calculated. (Chi is pronounced “kai”.) Data Desk takes care of the calculations, and a rule of thumb for interpreting the result will suffice. (1)

Go back to the options for the contingency table. Deselect “Expected value”, and choose “Chi-square value” instead. Data Desk will then calculate the chi-square statistic. If it’s large, that is enough to reject the null hypothesis. This number is not interpreted on its own. Rather, the important statistic to watch for appears directly underneath it: p. The p-value in the table below is less than or equal to 0.0001 – a very small value.

The lowercase p stands for “probability.” A p-value of 0.0001 is the same as a 0.01% chance of something happening. We ask this question: “If there were no relationship between ‘BusPhone Present’ and “JobTitle Present’, what is the probability that you would get a result for chi-square at least as high as 1,141?” That probability is your p-value.

By convention, you need a p-value of 0.05 (i.e. 5 percent) or less to consider the probability low enough to conclude that there must, in fact, be a relationship between ‘BusPhone Present’ and ‘JobTitle Present’. (In other words, you can reject the null hypothesis.) The probability of obtaining a value of chi-square this large if the variables were independent is extremely low: less than 0.01%. The p-value in this case is very small, therefore the variables are strongly related.

We will encounter p-values again when we do regression. (2)

Given the foregoing discussion, you might be thinking that exploring relationships among variables is a complicated and subtle business. Not really. You don’t have to study expected values or formulate a null hypothesis – I introduced those things to help you understand where chi-square comes from. You only need these steps to test for presence of a relationship:

1. Select the icons of the two variables.
2. Create a contingency table from the Calc menu.
3. In the table options menu, select “Chi-square value.”
4. Regardless of the chi-square value, if the p-value is less than 0.05, the variables are not independent – they are related somehow.

You can create a new contingency table whenever you want to test two new variables, or you can simply drag a new category variable into an existing table. Just drag a new categorical variable icon on top of the name of the variable you want to replace. These names are in bold face text near the top of the table window, and will highlight when touched by the cursor during the drag. You can replace both factors at once by selecting two new variables and dragging them both into the centre of the table.

Before we move on, try dragging other binary variables into the contingency table to quickly see which combinations yield some interesting associations.

A more relevant example

Dragging variables one by one into a contingency table is a useful exploration step while evaluating potential predictors for use in a model. This implies that one of the variables must be the target (outcome) variable.

Let’s say you plan to build a predictive model to identify which alumni are more likely to make a donation at a higher level than usual, and you want to make a preliminary assessment of potential predictors so you can toss out any that don’t look promising, while keeping the others for use in the regression analysis. You can create a target variable using a lifetime giving variable that will allow you to do this analysis using contingency tables.

Working with the Data University file, create a new derived variable called ‘Big donor’ with the expression:

This will create a binary variable that evaluates to ‘1’ for alumni with lifetime hard-credit giving greater than \$999, and ‘0’ for everyone else.

You can use whatever dollar value you like, but \$1,000 and up will work here. According to a frequency table of ‘Big donor’, 602 people have lifetime giving of \$1,000 or more.

Click on the icon for ‘Big donor’ to select it and create a contingency table from the Calc menu, and change the table options to include chi-square. Now the window is ready for testing variables one by one, by dragging variables into the space that is initially labeled “Drag Variable Here.”

Chi-square is great for indicating that two variables are related, but a little more information will help you understand the nature of the relationship. When you drag in “BusPhone Present,” for example, it seems apparent that the relationship is significant, to judge from the chi-square value. One tweak will tell you something more concrete: Go back to the table options, deselect “Count” and select “Percent of column total.”

This replaces cell counts with percent values – the percent breakdown for each column. The circled values in the table above are the ones to pay attention to – they are the cases that have a ‘1’ for ‘Big donor’. Here is what we can read from this:

• The first column of figures includes all cases for which there is no business phone in the database (0). Only 8.39% of people with no business phone are also top donors.
• The second column includes all cases with a business phone (1) – 18.4% of people with a business phone are top donors.

People with a business phone are more than twice as likely to be top donors, and the chi-square value indicates that the relationship is significant. It might be hasty to conclude that ‘BusPhone Present’ is truly a “predictor” of high-level giving, but the association is clearly there in the data.

(These percentages total on the columns. If ‘Big donor’ were your second variable instead of your first, the matrix would be ordered in the other direction, and you would choose “Percent of row total” from the table options instead.)

Try dragging other binary variables from the Data University set into the table, replacing ‘BusPhone Present’, and observe how the percent breakdowns differ from group to group, and whether that difference is significant.

In my book-in-progress, I go on to talk about exploring categorical variables with multiple categories – not just binary variables – but that’s all for today. Again, if you’re interested in knowing more as this project progress, please subscribe to this blog via email from the box in the right sidebar.

End notes

1) If you must know, the value of chi-square is the sum of the squared standardized residuals across all cells in the table. The standardized residual describes the extent to which the observed count differs from the expected count. You can display this statistic in Data Desk along with the others, but there is no need. Another abbreviation in the table is “df”, which stands for degrees of freedom. Probability values of chi-square depend upon the degrees of freedom, which in turn is related to the number of rows and columns in the table.

2) Use and misuse of p-values is a hot topic in statistics, especially in connection with academic publishing, where putting too much stock in p-values and the rather arbitrary 0.05 threshold causes a great deal of angst. As predictive modelers we need to keep these controversies in perspective: We are not testing the effects of new medical treatments nor are we seeking to publish in a scientific journal. We are making use of “good-enough” tools that will produce a useful result. That said, applying the 0.05 threshold without judgment and common sense means that we will detect “relationships” that are not real – potentially about 5% of the time!

## 7 January 2015

### New finds in old models

When you build a predictive model, you can never be sure it’s any good until it’s too late. Deploying a mediocre model isn’t the worst mistake you can make, though. The worst mistake would be to build a second mediocre model because you haven’t learned anything from the failure of the first.

Performance against a holdout data set for validation is not a reliable indicator of actual performance after deployment. Validation may help you decide which of two or more competing models to use, or it may provide reassurance that your one model isn’t total junk. It’s not proof of anything, though. Those lovely predictors, highly correlated with the outcome, could be fooling you. There are no guarantees they’re predictive of results over the year to come.

In the end, the only real evidence of a model’s worth is how it performs on real results. The problem is, those results happen in the future. So what is one to do?

I’ve long been fascinated with Planned Giving likelihood. Making a bequest seems like the ultimate gesture of institutional affinity (ultimate in every sense). On the plus side, that kind of affinity ought to be clearly evidenced in behaviours such as event attendance, giving, volunteering and so on. On the negative side, Planned Giving interest is uncommon enough that comparing expectancies with non-expectancies will sometimes lead to false predictors based on sparse data. For this reason, my goal of building a reliable model for predicting Planned Giving likelihood has been elusive.

Given that a validation data set taken from the same time period as the training data can produce misleading correlations, I wondered whether I could do one better: That is, be able to draw my holdout sample not from data of the same time period as that used to build the model, but from the future.

As it turned out, yes, I could.

Every year I save my regression analyses as Data Desk files. Although I assess the performance of the output scores, I don’t often go back to the model files themselves. However, they’re there as a document of how I approached modelling problems in the past. As a side benefit, each file is also a snapshot of the alumni population at that point in time. These data sets may consist of a hundred or more candidate predictor variables — a well-rounded picture.

My thinking went like this: Every old model file represents data from the past. If I pretend that this snapshot is really the present, then in order to have access to knowledge of the future, all I have to do is look at today’s data stored in the database.

For example, for this blog post, I reached back two years to a model I created in Data Desk for predicting likelihood to upgrade to the Leadership level in Annual Giving. I wasn’t interested in the model itself. Rather, I wanted to examine the underlying variables I had to work with at the time. This model had been an ambitious undertaking, with some 170 variables prepared for analysis. Many of course were transformations of variables or combinations of interacting variables. Among all those variables was one indicating whether a case was a current Planned Giving expectancy or not, at that point in time.

In this snapshot of the database from two years ago, some of the cases that were not expectancies would have become so since then. In other words, I now had the best of both worlds. I had a comprehensive set of potential predictors as they existed two years ago, AND access to the hitherto unknowable future: The identities of the people who had become expectancies after the predictors had been frozen in time.

As I said, my old model was not intended to predict Planned Giving inclination. So I built a new model, using “Is an Expectancy” (0/1) as the target variable. I trained the regression model on the two-year-old expectancy data — I didn’t even look at the new expectancies while building the model. No: I used those new expectancies as my validation data set.

“Validation” might be too strong a word, given that there were only 80 or so new cases. That’s a lot of bequest intentions, for sure, but in terms of data it’s a drop in the bucket compared with the number of cases being scored. Let’s call it a test data set. I used this test set to help me analyze the model, in a couple of ways.

First I looked at how new expectancies were scored by the model I had just built. The chart below shows their distribution by score decile. Slightly more than 50% of new expectancies were in the top decile. This looks pretty good — keeping in mind that this is what actual performance would have looked like had I really built this model two years ago (which I could have):

(Even better, looking at percentiles, most of the expectancies in that top 10% are concentrated nicely in the top few percentiles.)

But I didn’t stop there. It is also evident that almost half of new expectancies fell outside the top 10 percent of scores, so clearly there was room for improvement. My next step was to examine the individual predictors I had used in the model. These were of course the predictors most highly correlated with being an expectancy. They were roughly the following:
• Year person’s personal information in the database was last updated
• Number of events attended
• Age
• Number of alumni activities
• Indicated “likely to donate” on 2009 alumni survey
• Total giving in last five years (log transformed)
• Combined length of name Prefix + Suffix

I ranked the correlation of each of these with the 0/1 indicator meaning “new expectancy,” and found that most of the predictors were still fine, although they changed their order in the rank correlation. Donor likelihood (from survey) and recent giving were more important, and alumni activities and how recently a person’s record was updated were less important.

This was interesting and useful, but what was even more useful was looking at the correlations between ALL potential predictors and the state of being a new expectancy. A number of predictors that would have been too far down the ranked list to consider using two years ago were suddenly looking much better. In particular, many variables related to participation in alumni surveys bubbled closer to the top as potentially significant.

This exercise suggests a way to proceed with iterative, yearly improvements to some of your standard models:
• Dig up an old model from a year or more ago.
• Query the database for new cases that represent the target variable, and merge them with the old datafile.
• Assess how your model performed or, if you created more than one model, see which model would have performed best. (You should be doing this anyway.)
• Go a layer deeper, by studying the variables that went into those models — the data “as it was” — to see which variables had correlations that tricked you into believing they were predictive, and which variables truly held predictive power but may have been overlooked.
• Apply what you learn to the next iteration of the model. Leave out the variables with spurious correlations, and give special consideration to variables that may have been underestimated before.

## 22 September 2014

### What predictor variables should you avoid? Depends on who you ask

People who build predictive models will tell you that there are certain variables you should avoid using as predictors. I am one of those people. However, we disagree on WHICH variables one should avoid, and increasingly this conflicting advice is confusing those trying to learn predictive modeling.

The differences involve two points in particular. Assuming charitable giving is the behaviour we’re modelling for, those two things are:

1. Whether we should use past giving to predict future giving, and
2. Whether attributes such as marital status are really predictors of giving.

I will offer my opinions on both points. Note that they are opinions, not definitive answers.

1. Past giving as a predictor

I have always stressed that if you are trying to predict “giving” using a multiple linear regression model, you must avoid using “giving” as a predictor among your independent variables. That includes anything that is a proxy for “giving,” such as attendance at a donor-thanking event. This is how I’ve been taught and that is what I’ve adhered to in practice.

Examples that violate this practice keep popping up, however. I have an email from Atsuko Umeki, IT Coordinator in the Development Office of the University of Victoria in Victoria, British Columbia*. She poses this question about a post I wrote in July 2013:

“In this post you said, ‘In predictive models, giving and variables related to the activity of giving are usually excluded as variables (if ‘giving’ is what we are trying to predict). Using any aspect of the target variable as an input is bad practice in predictive modelling and is carefully avoided.’  However, in many articles and classes I read and took I was advised or instructed to include past giving history such as RFA*, Average gift, Past 3 or 5 year total giving, last gift etc. Theoretically I understand what you say because past giving is related to the target variable (giving likelihood); therefore, it will be biased. But in practice most practitioners include past giving as variables and especially RFA seems to be a good variable to include.”

(* RFA is a variation of the more familiar RFM score, based on giving history — Recency, Frequency, and Monetary value.)

So modellers-in-training are being told to go ahead and use ‘giving’ to predict ‘giving’, but that’s not all: Certain analytics vendors also routinely include variables based on past giving as predictors of future giving. Not long ago I sat in on a webinar hosted by a consultant, which referenced the work of one well-known analytics vendor (no need to name the vendor here) in which it seemed that giving behaviour was present on both sides of the regression equation. Not surprisingly, this vendor “achieved” a fantastic R-squared value of 86%. (Fantastic as in “like a fantasy,” perhaps?)

This is not as arcane or technical as it sounds. When you use giving to predict giving, you are essentially saying, “The people who will make big gifts in the future are the ones who have made big gifts in the past.” This is actually true! The thing is, you don’t need a predictive model to produce such a prospect list; all you need is a list of your top donors.

Now, this might be reassuring to whomever is paying a vendor big bucks to create the model. That person sees names they recognize, and they think, ah, good — we are not too far off the mark. And if you’re trying to convince your boss of the value of predictive modelling, he or she might like to see the upper ranks filled with familiar names.

I don’t find any of that “reassuring.” I find it a waste of time and effort — a fancy and expensive way to produce a list of the usual suspects.

If you want to know who has given you a lot of money, you make a list of everyone in your database and sort it in descending order by total amount given. If you want to predict who in your database is most likely to give you a lot of money in the future, build a predictive model using predictors that are associated with having given large amounts of money. Here is the key point … if you include “predictors” that mean the same thing as “has given a lot of money,” then the result of your model is not going to look like a list of future givers — it’s going to look more like your historical list of past givers.

Does that mean you should ignore giving history? No! Ideally you’d like to identify the donors who have made four-figure gifts who really have the capacity and affinity to make six-figure gifts. You won’t find them using past giving as a predictor, because your model will be blinded by the stars. The variables that represent giving history will cause all other affinity-related variables to pale in comparison. Many will be rejected from the model for being not significant or for adding nothing additional to the model’s ability to explain the variance in the outcome variable.

To sum up, here are the two big problems with using past giving to predict future giving:

1. The resulting insights are sensible but not very interesting: People who gave before tend to give again. Or, stated another way: “Donors will be donors.” Fundraisers don’t need data scientists to tell them that.
2. Giving-related independent variables will be so highly correlated with giving-related dependent variables that they will eclipse more subtle affinity-related variables. Weaker predictors will end up getting kicked out of our regression analysis because they can’t move the needle on R-squared, or because they don’t register as significant. Yet, it’s these weaker variables that we need in order to identify new prospects.

Let’s try a thought experiment. What if I told you that I had a secret predictor that, once introduced into a regression analysis, could explain 100% of the variance in the dependent variable ‘Lifetime Giving’? That’s right — the highest value for R-squared possible, all with a single predictor. Would you pay me a lot of money for that? What is this magic variable that perfectly models the variance in ‘Lifetime Giving’? Why, it is none other than ‘Lifetime Giving’ itself! Any variable is perfectly correlated with itself, so why look any farther?

This is an extreme example. In a real predictive model, a predictor based on giving history would be restricted to giving from the past, while the outcome variable would be calculated from a more recent period — the last year or whatever. There should be no overlap. R-squared would not be 100%, but it would be very high.

The R-squared statistic is useful for guiding you as you add variables to a regression analysis, or for comparing similar models in terms of fit with the data. It is not terribly useful for deciding whether any one model is good or bad. A model with an R-squared of 15% may be highly valuable, while one with R-squared of 75% may be garbage. If a vendor is trying to sell you on a model they built based on a high R-squared alone, they are misleading you.

The goal of predictive modeling for major gifts is not to maximize R-squared. It’s to identify new prospects.

2. Using “attributes” as predictors

Another thing about that webinar bugged me. The same vendor advised us to “select variables with caution, avoiding ‘descriptors’ and focusing on potential predictors.” Specifically, we were warned that a marital status of ‘married’ will emerge as correlated with giving. Don’t be fooled! That’s not a predictor, they said.

So let me get this straight. We carry out an analysis that reveals that married people are more likely to give large gifts, that donors with more than one degree are more likely to give large gifts, that donors who have email addresses and business phone numbers in the database are more likely to give large gifts … but we are supposed to ignore all that?

The problem might not be the use of “descriptors,” the problem might be with the terminology. Maybe we need to stop using the word “predictor”. One experienced practitioner, Alexander Oftelie, briefly touched on this nuance in a recent blog post. I quote, (emphasis added by me):

“Data that on its own may seem unimportant — the channel someone donates, declining to receive the mug or calendar, preferring email to direct mail, or making ‘white mail’ or unsolicited gifts beyond their sustaining-gift donation — can be very powerful when they are brought together to paint a picture of engagement and interaction. Knowing who someone is isn’t by itself predictive (at best it may be correlated). Knowing how constituents choose to engage or not engage with your organization are the most powerful ingredients we have, and its already in our own garden.”

I don’t intend to critique Alexander’s post, which isn’t even on this particular topic. (It’s a good one — please read it.) But since he’s written this, permit me scratch my head about it a bit.

In fact, I think I agree with him that there is a distinction between a behaviour and a descriptor/attribute. A behaviour, an action taken at a specific point in time (eg., attending an event), can be classified as a predictor. An attribute (“who someone is,” eg., whether they are married or single) is better described as a correlate. I would also be willing to bet that if we carefully compared behavioural variables to attribute variables, the behaviours would outperform, as Alexander says.

In practice, however, we don’t need to make that distinction. If we are using regression to build our models, we are concerned solely and completely with correlation. To say “at best it may be correlated” suggests that predictive modellers have something better at their disposal that they should be using instead of correlation. What is it? I don’t know, and Alexander doesn’t say.

If in a given data set, we can demonstrate that being married is associated with likelihood to make a donation, then it only makes sense to use that variable in our model. Choosing to exclude it based on our assumption that it’s an attribute and not a behaviour doesn’t make business sense. We are looking for practical results, after all, not chasing some notion of purity. And let’s not fool ourselves, or clients, that we are getting down to causation. We aren’t.

Consider that at least some “attributes” can be stated in terms of a behaviour. People get married — that’s a behaviour, although not related to our institution. People get married and also tell us about it (or allow it to be public knowledge so that we can record it) — that’s also a behaviour, and potentially an interaction with us. And on the other side of the coin, behaviours or interactions can be stated as attributes — a person can be an event attendee, a donor, a taker of surveys.

If my analysis informs me that widowed female alumni over the age of 60 are extremely good candidates for a conversation about Planned Giving, then are you really going to tell me I’m wrong to act on that information, just because sex, age and being widowed are not “behaviours” that a person voluntarily carries out? Mmmm — sorry!

Call it quibbling over semantics if you like, but don’t assume it’s so easy to draw a circle around true predictors. There is only one way to surface predictors, which is to take a snapshot of all potentially relevant variables at a point in time, then gather data on the outcome you wish to predict (eg., giving) after that point in time, and then assess each variable in terms of the strength of association with that outcome. The tools we use to make that assessment are nothing other than correlation and significance. Again, if there are other tools in common usage, then I don’t know about them.

Caveats and concessions

I don’t maintain that this or that practice is “wrong” in all cases, nor do I insist on rules that apply universally. There’s a lot of art in this science, after all.

Using giving history as a predictor:

• One may use some aspects of giving to predict outcomes that are not precisely the same as ‘Giving’, for example, likelihood to enter into a Planned Giving arrangement. The required degree of difference between predictors and outcome is a matter of judgement. I usually err on the side of scrupulously avoiding ANY leakage of the outcome side of the equation into the predictor side — but sure, rules can be bent.
• I’ve explored the use of very early giving (the existence and size of gifts made by donors before age 30) to predict significant giving late in life. (See Mine your donor data with this baseball-inspired analysis.) But even then, I don’t use that as a variable in a model; it’s more of a flag used to help select prospects, in addition to modeling.

Using descriptors/attributes as predictors:

• Some variables of this sort will appear to have subtly predictive effects in-model, effects that disappear when the model is deployed and new data starts coming in. That’s regrettable, but it’s something you can learn from — not a reason to toss all such variables into the trash, untested. The association between marital status and giving might be just a spurious correlation — or it might not be.
• Business knowledge mixed with common sense will help keep you out of trouble. A bit of reflection should lead you to consider using ‘Married’ or ‘Number of Degrees’, while ignoring ‘Birth Month’ or ‘Eye Colour’. (Or astrological sign!)

There are many approaches one can take with predictive modeling, and naturally one may feel that one’s chosen method is “best”. The only sure way to proceed is to take the time to define exactly what you want to predict, try more than one approach, and then evaluate the performance of the scores when you have actual results available — which could be a year after deployment. We can listen to what experts are telling us, but it’s more important to listen to what the data is telling us.

//////////

Note: When I originally posted this, I referred to Atsuko Umeki as “he”. I apologize for this careless error and for whatever erroneous assumption that must have prompted it.

## 16 January 2012

Filed under: Correlation, Predictor variables, skeptics — Tags: , , , , — kevinmacdonell @ 1:03 pm

Some of the best predictors in my models are related to the presence or absence of phone numbers and addresses. For example, the presence of a business phone is usually a highly significant predictor of giving. As well, a count of either phone or address updates present in the database is also highly correlated with giving.

Some people have difficulty accepting this as useful information. The most common objection I hear is that such updates can easily come from research and data appends, and are therefore not signals of affinity at all. And that would be true: Any data that exists solely because you bought it or looked it up doesn’t tell you how someone feels about your institution. (Aside from the fact that you had to go looking for them in the first place — which I’ve observed is negatively correlated with giving.)

Sometimes this objection comes from someone who is just learning data mining. Then I know I’m dealing with someone who’s perceptive. They obviously get it, to some degree — they understand there’s potentially a problem.

I’m less impressed when I hear it from knowledgeable people, who say they avoid contact information in their variable selection altogether. I think that’s a shame, and a signal that they aren’t willing to put in the work to a) understand the data they’re working with, or b) take steps to counteract the perceived taint in the data.

If you took the trouble to understand your data (and why wouldn’t you), you’d find out soon enough if the variables are useable:

• If the majority of phone numbers or business addresses or what-have-you are present in the database only because they came off donors’ cheques, then you’re right in not using it to predict giving. It’s not independent of giving and will harm your model. The telltale sign might be a correlation with the target variable that exceeds correlations for all your other variables.
• If the information could have come to you any number of ways (with gift transactions being only one of them), then use with caution. That is, be alert if the correlation looks too good to be true. This is the most likely scenario, which I will discuss in detail shortly.
• If the information could only have come from data appends or research, then you’ve got nothing much to worry about: The correlation with giving will be so weak that the variable probably won’t make it into your model at all. Or it may be a negative predictor, highlighting the people who allowed themselves to become lost in the first place. An exception to the “don’t worry” policy would be if research is conducted mainly to find past donors who have become lost — then there might be a strong correlation that will lead you astray.

An in-house predictive modeler will simply know what the case is, or will take the trouble to find out. A vendor hired to do the work may or may not bother — I don’t know. As far as my own models are concerned, I know that addresses and phone numbers come to us via a mix of voluntary and involuntary means: Via Phonathon, forms on the website, records research, and so on.

I’ve found that a simple count of all historical address updates for each alum is positively correlated with giving. But a line plot of the relationship between number of address updates and average lifetime giving suggests there’s more going on under the surface. Average lifetime giving goes up sharply for the first half-dozen or so updates, and then falls away just as sharply. This might indicate a couple of opposing forces: Alumni who keep us informed of their locations are more likely to be donors, but alumni who are perpetually lost and need to be found via research are less likely to be donors.

If you’re lucky, your database not only has a field in which to record the source of updates, but your records office is making good use of it. Our database happens to have almost 40 different codes for the source, applied to some 300,000 changes of address and/or phone number. Not surprisingly, some of these are not in regular use — some account for fewer than one-tenth of one percent of updates, and will have no significance in a model on their own.

For the most common source types, though, an analysis of their association with giving is very interesting. Some codes are positively correlated with giving, some negatively. In most cases, a variable is positive or negative depending on whether the update was triggered by the alum (positive), or by the institution (negative). On the other hand, address updates that come to us via Phonathon are negatively correlated with giving, possibly because by-mail donors tend not to need a phone call — if ‘giving’ were restricted to phone solicitation only, perhaps the association might flip toward the positive. Other variables that I thought should be positive were actually flat. But it’s all interesting stuff.

For every source code, a line plot of average LT giving and number of updates is useful, because the relationship is rarely linear. The relationship might be positive up to a point, then drop off sharply, or maybe the reverse will be true. Knowing this will suggest ways to re-express the variable. I’ve found that alumni who have a single update based on the National Change of Address database have given more than alumni who have no NCOA updates. However, average giving plummets for every additional NCOA update. If we have to keep going out there to find you, it probably means you don’t want to be found!

Classifying contact updates by source is more work, of course, and it won’t always pay off. But it’s worth exploring if your goal is to produce better, more accurate models.

## 24 March 2011

### Does your astrological sign predict whether you’ll give?

Filed under: Alumni, Correlation, Peter Wylie, Predictor variables, Statistics — Tags: , , — kevinmacdonell @ 7:20 am

Last weekend, with so many other pressing things I could have been doing, I got it in my head to analyze people’s astrological signs for potential association with propensity to give. I don’t know what came over me; perhaps it was the Supermoon. But when you’ve got a data set in front of you that contains giving history and good birth dates for nearly 85,000 alumni, why not?

Let me say first that I put no stock in astrology, but I know a few people who think being a Libra or a Gemini makes some sort of difference. I imagine there are many more who are into Chinese astrology, who think the same about being a Rat or a Monkey. And even I have to admit that an irrational aspect of me embraces my Taurus/Rooster nature.

If one’s sign implies anything about personality or fortune, I should think it would be reflected in one’s generosity. Ever in pursuit of the truth, I spent a rather tedious hour parsing 85,000 birth dates into the signs of the zodiac and the animal signs of Chinese astrology. As you will see, there are in fact some interesting patterns associated with birth date, on the surface at least.

Because human beings mate at any time of year, the alumni in the sample are roughly equally distributed among the 12 signs of the zodiac. There seem to be slightly more births in the warmer months than in the period of December to February: Cancer (June 21 to July 22) represents 8.9% of the sample while at the lower end, Capricorn (Dec 22 to Jan 20) represents 7.6% of the sample — a spread of less than two percentage points.

What we want to know is if any one sign is particularly likely to give to alma mater. I coded anyone who had any giving in their lifetime as ‘1’ and all never-donors as ‘0’. At the high end, Taurus natives have a donor rate of 30.7% and at the low end, Aries natives have a donor rate of 29.0%. All the other signs fall between those two rates, a range of a little more than one and a half percentage points.

That’s a very narrow range of variance. If I were seriously evaluating the variable ‘Astrological sign’ as a predictor, I would probably stop right there, seeing nothing exciting enough to make me continue.

But have a look at this bar chart. I’ve arranged the signs in their calendar order, which immediately suggests that there’s a pattern in the data: A peak at Taurus, gradually falling to Scorpio, peaking again at Sagittarius, then falling again until Taurus comes around once more.

The problem with the bar chart is that the differences in giving rates are exaggerated visually, because the range of variance is so limited. What appears to be a pattern may be nothing of the sort.

In fact, the next chart tells a conflicting tale. The Tauruses may have the highest participation rate, but among donors they and three other signs have the lowest median level of lifetime giving (\$150), and Aries have the highest median (\$172.50). The calendar-order effect we saw above has vanished.

These two charts fail to tell the same tale, which indicates to me that although we may observe some variance in giving between astrological signs, the variance might well be due to mere chance. Is there a way to demonstrate this statistically? I was discussing this recently with Peter Wylie, who helped me sort this out. Peter told me that the supposed pattern in the first chart reminded him of the opening of Malcolm Gladwell’s book, Outliers, in which the author examines why a hugely disproportionate number of professional hockey and soccer players are born in January, February and March. (I won’t go farther than that — read the book for that discussion.)

In the case of professional hockey players, birth date and a player’s development (and career progress) are definitely associated. It’s not due to a random effect. In the case of birth date and giving, however, there is room for doubt. Peter took me through the use of chi-square, a statistic I hadn’t encountered since high school. I’m not going into detail about chi-square — there is plenty out there online to read — but briefly, chi-square is used to determine if a distribution of observed frequencies of a value for a categorical or ordinal variable differs from the theoretical expected frequencies for that variable, and from there, if the discrepancy is statistically significant.

Figuring out the statistical significance part used to involve looking up the calculated value for chi-square in a table based on something called degrees of freedom, but nowadays your stats software will automatically provide you with a statistic telling you whether the result is significant or not: the p statistic, which will be familiar to you if you’ve used linear regression. The rule of thumb for significance is a p-value of 0.05 or less.

As it turns out, the observed differences in the frequency of donors for each astrological sign has a significance value of p = 0.3715. This is way above the 0.05 confidence level, and therefore we cannot rule out the possibility that these variations are due to mere chance. So astrology is a bust for fundraisers.

Now for something completely different. We haven’t looked at the Chinese animal signs yet. Here is a table showing a breakdown by Chinese astrological sign by the percentage of alumni with at least some giving, and median lifetime giving. The table is sorted by donor participation rate, lowest to highest.

Hmm, it would seem that being a Horse is associated with a higher level of generosity than the norm. And here’s the biggest surprise: A Chi-square test reveals the differences in donor frequencies between animals to be significant! (p-value < 0.0001).

What’s going on here? Shall we conclude that the Chinese astrologers have it all figured out?

Let’s go back to the data. First of all, how were alumni assigned an animal sign in the first place? You may be familiar with the paper placemats in Chinese restaurants that list birth years and their corresponding animal signs. Anyone born in the years 1900, 1912, 1924, 1936, 1948, 1960, 1972, 1984, 1996 or 2008 is a Rat. Anyone born in 1901, 1913, 1925, etc. etc. is an Ox, and so on, until all the years are accounted for. Because the alumni in each animal category are drawn from birth years with an equal span of years between them, we might assume that each sign has roughly the same average age. This is key, because if the signs differ on average age, then age might be an underlying cause of variations in giving.

My data set does not include anyone born before 1930, and goes up to 1993 (a single precocious alum who graduated at a very young age). Tigers, with the lowest participation rate, are drawn from the birth years 1938, 1950, 1962, 1974 and 1986. Horses, with the highest participation rate, are drawn from the birth years 1930, 1942, 1954, 1966 and 1978, plus only a handful of young alumni from 1990. For Tigers, 77% were born in 1974 or earlier, but for Horses, 99% of alumni were born in 1978 or earlier.

The bottom line is that the Horses in my data set are older than the Tigers, as a group. The Horses have a median age of 45, while the Tigers have a median age of 37. And we all know by now that older alumni are more likely to be donors.

Again, my conversation with Peter Wylie helped me figure this out statistically. The short answer is: After you’ve accounted for the age of alumni, variations in giving by animal sign are no longer significant.

(The longer answer is: If you perform a linear regression of Age on Lifetime Giving (log-transformed) and compute residuals, then run an Analysis of Variance (ANOVA) for the residuals and Animal Sign, the variance is NOT significant, p = 0.1118. The residuals can be thought of as Lifetime Giving with the explanatory effect of Age “washed out,” leaving only the unexplained variance. Animal Sign fails to account for any significant amount of the remaining variance in LT Giving, which is an indication that Animal Sign is just a proxy for Age.)

Does any of this matter? Mostly no. First of all, a little common sense can keep you out of trouble. Sure, some significant predictors will be non-intuitive, but it doesn’t hurt to be skeptical. Second, if you do happen to prepare some predictors based on astrological sign, their non-significance will be evident as soon as you add them to your regression analysis, particularly if you’ve already added Age or Class Year as a predictor in the case of the Chinese signs. Altogether, then, the risk that your models will be harmed by such meaningless variables is very low.

## 14 March 2011

### Correlation and you

Filed under: Correlation, Predictor variables, regression, Statistics — Tags: , , — kevinmacdonell @ 7:25 am

If you read books and blogs on statistics, eventually your understanding of even the most basic concepts will start to smear. Things we think ought to be well-established by now are matters of controversy. Ranking high on the list of most slippery concepts is correlation. In today’s post, I’m going to make the concept very complicated, and then I’m going to dismiss the complexity and make it all simple again. I’m telling you this in advance so I don’t lose you partway through.

Correlation is the foundation of predictive modeling. The degree to which one variable x, changes its value in relation to a second variable y, either positively or negatively, is the very definition of x‘s usefulness in predicting the value of y. The tool I use to quantify the strength of that relationship is Pearson Product-Moment Correlation, or Pearson’s r, which some statistics texts simply call “correlation”. (It would help if you read the blog post on Pearson before reading this one.)

Before we go any farther, I need to explain why I want to quantify the strength of relationships. At the start of any modeling project, I have a hundred or so potential predictor variables in my data file. Some are going to be excellent predictors, most will be only so-so, and others will have little or no association with the outcome variable. I want to introduce variables into the regression analysis in an order that makes sense, so that the best predictors are added first. Due to the complexity of interactions among variables, there is no telling in advance what will actually happen as variables are added, so any list of variables ordered by strength of correlation is merely a rough guide.

What I DON’T use Pearson’s r for much is exploring variables. At the exploration stage, before modeling begins, I will look at a variable a number of ways in order to get a sense of how valuable the variable will be, and how I might want to transform it or re-express it to make it better. I will look at how the variable is distributed, and compare average and median giving between groups (for example, Home Phone Present, Y/N). These and other techniques, as described in Peter Wylie’s book “Data Mining for Fundraisers,” are simpler, more direct, and often more helpful than abstract measures of correlation.

It’s only after I’ve done the exploration work and tweaked the variables for maximum effect that I’m ready to rank them in order by their correlation values. So, with that out of the way, let’s look at Pearson’s r in more detail.

The textbooks make it abundantly clear: Pearson’s r quantifies the relationship between two continuous variables that are linearly related. Right away, we’ve got a problem: Very few of the variables I work with are continuous, and most of the relationships I see do not meet the definition of “linear”. Yet, I use Pearson’s r exclusively. Does this mean I’ve been misusing and abusing the method?

I don’t see it that way, but it is interesting to read how these things are discussed in the literature.

You can assess whether a relationship between two variables is roughly linear by looking at a scatterplot of the variables. Below is a scatterplot of ‘Giving’ (log-transformed) and ‘Age’, created in Data Desk. It’s a big, messy cloud of points (some 80,000 of them!). A lot of the relationship is hidden by overplotting (the overlapping of points) along the bottom line — that row of points at the zero giving mark represents non-donors, and there are many more of them near the young end of that line than there are near the older end. Still, at least you can see a vague linear relationship in the upward fanning of the data: As age increases, so does lifetime giving. A best-fit line through the data would slope upward from left to right, and therefore the Pearson correlation value is high.

We’ve got two continuous variables, and a linear relationship, so we get the Statistician’s Seal of Approval: It’s okay to use Pearson’s r to measure strength of correlation between these two variables.

Unfortunately, as I said, most of the variables we use in predictive modeling are not continuous, and they don’t look like much of anything in a scatterplot. Here’s a scatterplot of a Likert-scale survey response. The survey question asked alumni how likely they are to donate to alma mater, and the scale runs from 1 to 5, with 5 being “very likely.” This is not a continuous variable, because there are no possible intermediate values among the five levels. It’s ordinal. The plot with Lifetime Giving is difficult to interpret, but it sure isn’t linear:

Yes, the line of points for the highest response, 5, does extend higher than any other line in terms of lifetime giving. But due to overplotting, there is no way to tell how many nondonors lurk in the single dots that appear at the foot of every line and which indicate zero dollars given. This in no way resembles a cloud of points through which one can imagine a best-fit line being drawn. I can TELL you that a positive response to this question is strongly associated with high levels of lifetime giving, and it is, but you could be forgiven for remaining unconvinced by this “evidence”.

Even worse are the most common predictor variables: indicator variables, in binary form (0/1). For example, let’s say I express the condition of being ‘Married’ as a binary variable. A scatterplot of ‘Giving’ and ‘Married’ is even less useful than the one for the survey question:

Yuck! We’ve got 80,000 data points all jammed up in two solid lines at zero (not married) and 1 (married). We can’t tell from this plot, but it just so happens that the not-married line contains a lot of points sitting at lifetime giving of zero, far more than the married line. Being married is associated with giving, but who could tell from this? There’s no way this relationship is linear.

One of the tests for the appropriateness of using Pearson’s r is whether a scatterplot of the variable looks like a “straight enough” line. Another test is that both variables are quantitative and continuous — not categorical (or ordinal) and discrete. The fact that I can take a categorical variable (Married) and re-express it as a number (0/1) makes no difference. Turning it into a number makes it possible to calculate Pearson’s r, but that doesn’t make it okay to do so.

So the textbooks tell us. What else do they have to say? Read on.

There are methods other than Pearson’s r which we can use to measure the degree of association between two variables, which do not require the presence of a linear relationship. One of these is Spearman’s Rho, which is the correlation between the ranks of two variables. Rho replaces the data values themselves with their ranks within each variable — so the lowest value in each variable becomes ‘1’, the next lowest becomes ‘2’, and so on — and then calculates to what degree those values are related between the two variables, either negatively or positively.

Spearman’s Rho is sometimes called Spearman Rank Correlation, which muddies the waters a bit as it implies that it’s a measure of correlation. It is, but the correlation is between the ranks, not the data values themselves. The bottom line is that a statistician would tell us that Spearman is the appropriate calculation for putting a value to the strength of association between two variables that are not linearly related but which may show a consistently increasing or decreasing trend. Unlike Pearson’s r, it is a nonparametric method — it is free of any requirement that the distribution of the variables look a certain way.

Spearman’s Rho has some special properties which I won’t get into, but overall it looks a lot like Pearson’s r. It can take on values between -1 and 1 (a perfect negative relationship to a perfect positive relationship, both called monotone relationships); values near zero indicate the absence of a relationship — just like Pearson’s r. And your stats software makes it equally easy to compute.

So, great. We’ve got Spearman’s Rho, which we are told is just the thing for analyzing the Likert scale variable I showed you earlier. What about the indicator variable (for ‘Married’)? Well, no, we’re told: You can’t use Pearson or Spearman’s to calculate correlation for categorical variables. For that, you need the point-biserial correlation coefficient.

Huh?

That’s right, another measure of correlation. In fact, there are many types of measures out there. I’ve got a sheet in front of me right now that lists more than eight different measures, the choice of which depends on the combination of variables you’re analyzing (two continuous variables, one continuous and one ordinal variable, two binary variables, one binary and one ordinal, etc. etc.). And that list is not exhaustive.

Another wrinkle is that some texts don’t call these relationships “correlation” at all. By a strict definition, if a relationship between variables it isn’t appropriate for Pearson’s r, then it ain’t correlation. We are supposed to call it by the more vague term “association.”

Hey, I’m cool with that. I’ll call it whatever you want. But what are the practical implications of all this?

As near as I can tell, ZERO.

Uh-huh, I’ve just made you read more than a thousand words on how complicated correlation is. Now I’m going to dismiss all of that with an imperial wave of my hand. I need to you ignore everything I’ve just said, for two reasons which I will elaborate on in a moment:

1. As I stated at the outset, the only reason I calculate correlations is to explore which variables are most likely to figure prominently in a regression analysis. For a rough ranking of variables, Pearson’s r is a “good enough” tool.
2. The correlation r is the basis of linear regression, which is our end-goal. Not Spearman, not point-biserial, nor any other measure.

Regarding point number one: We are not concerned about the precise value of the calculated association, only the approximate ranking of our variables. All we want to know is: which variables are probably most valuable for our model and should be added to the regression first? As it turns out, rankings using other measures of correlation (sorry — association) hardly vary from a Pearson’s r ranking. It’s extra bother for nothing.

And to point number two: There’s a real disconnect between what the textbooks say about correlation analysis and what they say about regression. It seems to me that if Pearson’s r is inappropriate for all but continuous, linearly-related variables, then we would also be told that only continuous, linearly-related variables can be used in regression. That’s not the case: Social-science researchers and modelers toss ordinal and binary variables into regressions with wild abandon. If we didn’t we’d have almost nothing left to work with.

The disconnect is bridged with an explanation I found in one university stats textbook. It’s touched on only briefly, and towards the end of the book. The gist is this: For 0/1 indicator variables added to a linear regression, the coefficient of correlation is not the slope of a line, as we are always told to understand it. The indicator acts to vertically shift the line, so that instead of one regression slope, we have two: The unshifted line if the indicator variable is equal to zero, and another line shifted vertically up or down (depending on the sign of the coefficient), if the value is 1. This seems like essential information, but hardly rates any discussion at all. That’s stats for you.

To summarize:

1. Don’t be sidetracked by warnings regarding measures of association/correlation that have specific uses but do not relate to your end goal: Building a regression model for the pragmatic purpose of making predictions.
2. Most of the time, scatterplots can’t tell you what you need to know, because most of our data is categorical.
3. Indicator variables (and ordinal variables) are materially different from the pretty, linearly-related variables, and they are absolutely OK for use in regression.
4. SAY “association”, but DO “correlation”.
5. Don’t feel bad if you’re having difficulty learning predictive modeling from a stats textbook. I can’t see how anyone could.
Older Posts »