# CoolData blog

## 3 September 2015

### Testing associations between two categorical variables

Over the coming months I will be adapting various bits from a work in progress, a new book on building predictive models using regression, aimed at people working in nonprofit and higher education advancement.

(If you’re interested, I suggest subscribing via email — see the box to the right — to have the inside track on this project.)

In my previous post, I wrote about “Exploring associations between variables.” Today I go deeper to describe exactly how to test for an association or relationship between two variables that are categorical.

These directions are specific to the statistics software Data Desk, but apply equally to any stats package. I make frequent reference to contingency tables, which you may know better as cross-tabs or R x C tables (row by column tables).

As well, I mention a data file from “Data University.” This is a file of anonymized sample data provided to me by Peter Wylie and John Sammis, which readers of my book will be able to download and use to learn the concepts.

In your data file for Data University, you’ve got a variable for “Business phone present” and “Job title present”. You’d like to know if these two factors are related to each other. If they are completely unrelated, they are said to be independent of each other. We can analyze this in Data Desk using contingency tables. (If you don’t know why we care about dependence or independence among variables, read my previous post.)

Click once on the icon for ‘BusPhone Present’ to select it as the first variable, then Shift-click on ‘JobTitle Present’ to select it as the second variable. Under the Calc menu at the top of the screen, select Contingency Tables. The displayed result will depend on what default settings are in effect, but it might look like the table below. (Just as with frequency tables, you can play with display options from a menu available by clicking on the little triangle in the top left corner of the window.)

A contingency table gives us counts of how many cases fall into the categories that result from the intersection of two categorical variables. In this example, we have two binary variables, therefore there are four possibilities, arranged in a two-by-two matrix. The two “levels” of ‘BusPhone Present’ are labeled vertically (down the rows), and the two ‘levels’ of ‘Job Title Present’ are labeled horizontally (across the columns).

The labeling of rows and columns in contingency tables can be a bit confusing when the category names are just ones and zeroes, so here’s the same table with more descriptive labels.

If these variables are properly coded, with no missing values, each and every case in our data falls into one of the four cells of the matrix. We see for example that 2,240 cases are ‘0’ for ‘BusPhone Present’ and ‘0’ for ‘JobTitle Present’ – a large number of cases have neither piece of information coded in the database.

Looking at counts of cases in a contingency table can sometimes show cells with unusually many or unusually few cases. In this example, it seems to be fairly common that whenever Data University has a business phone number, it tends to have a job title as well, and it’s also common for both to be missing. It’s much less common for one to be present and the other not. This suggests there’s a relationship.

We’ve made a judgment call, just looking at the numbers. Fortunately the field of statistics provides some tests for making decisions more consistently and easily. If you have some training in stats you will readily grasp how we can apply these tests. For our purposes, we will rely on rules of thumb and simplified explanations rather than delve into a deep understanding of the tests themselves.

Click on the options menu for the contingency table, and add “Expected value” to the table, then click OK. This adds a new set of numbers to the table. The section at the bottom of the table, “table contents,” lists these elements in the order they appear in each cell. The first element is the count of cases that we’ve already seen. The other element is Expected Values.

“Expected values” are the counts of cases that would be expected to result if ‘BusPhone Present’ and ‘JobTltle Present’ were statistically independent of each other. In other words, if the two variables were completely unrelated, the probability that any one case would end up in a particular cell depends only on the probability that the case falls in a specified row and the probability that it falls in a specified column. In other words, we would expect the number of people with a business phone number AND who have a job title in the database to depend solely on the number of each in the sample – one doesn’t depend on the other. By this logic, the probability of a person having a business phone in the database is about the same whether or not they have a job title in the database, if it is true that one is not related to the other.

Have a quick look to compare each pair of numbers. You’ll notice that two actual counts are higher than the expected value, and two counts are lower than the expected value. If the variables were not associated with each other (independent), the actual counts would be a lot closer to the expected counts. It appears that if one variable is a zero or one, the other variable is more likely than not to have the same value – that is, we are likely to have either both or neither. Therefore, the variables do seem to be related.

Being able to compare the expected values to the observed values is helpful in identifying a pattern or relationship. But again, we are still just eyeballing the numbers and making a guess. Let’s take another step forward.

An assertion that two variables are not related is based on a concept in statistics called the null hypothesis. The null hypothesis in this case would state, “There is no relationship between Business Phone Present and Job Title Present.” In a test for independence between Business Phone and Job Title, if there is no relationship, then the null hypothesis is true. The alternative hypothesis is that the variables are in fact related – they are dependent, rather than independent.

The test for determining whether the null hypothesis is true in this case is called the chi-square test for independence. For our purposes, it is not necessary to formally state any kind of hypothesis, nor is it necessary to understand how chi-square is calculated. (Chi is pronounced “kai”.) Data Desk takes care of the calculations, and a rule of thumb for interpreting the result will suffice. (1)

Go back to the options for the contingency table. Deselect “Expected value”, and choose “Chi-square value” instead. Data Desk will then calculate the chi-square statistic. If it’s large, that is enough to reject the null hypothesis. This number is not interpreted on its own. Rather, the important statistic to watch for appears directly underneath it: p. The p-value in the table below is less than or equal to 0.0001 – a very small value.

The lowercase p stands for “probability.” A p-value of 0.0001 is the same as a 0.01% chance of something happening. We ask this question: “If there were no relationship between ‘BusPhone Present’ and “JobTitle Present’, what is the probability that you would get a result for chi-square at least as high as 1,141?” That probability is your p-value.

By convention, you need a p-value of 0.05 (i.e. 5 percent) or less to consider the probability low enough to conclude that there must, in fact, be a relationship between ‘BusPhone Present’ and ‘JobTitle Present’. (In other words, you can reject the null hypothesis.) The probability of obtaining a value of chi-square this large if the variables were independent is extremely low: less than 0.01%. The p-value in this case is very small, therefore the variables are strongly related.

We will encounter p-values again when we do regression. (2)

Given the foregoing discussion, you might be thinking that exploring relationships among variables is a complicated and subtle business. Not really. You don’t have to study expected values or formulate a null hypothesis – I introduced those things to help you understand where chi-square comes from. You only need these steps to test for presence of a relationship:

1. Select the icons of the two variables.
2. Create a contingency table from the Calc menu.
3. In the table options menu, select “Chi-square value.”
4. Regardless of the chi-square value, if the p-value is less than 0.05, the variables are not independent – they are related somehow.

You can create a new contingency table whenever you want to test two new variables, or you can simply drag a new category variable into an existing table. Just drag a new categorical variable icon on top of the name of the variable you want to replace. These names are in bold face text near the top of the table window, and will highlight when touched by the cursor during the drag. You can replace both factors at once by selecting two new variables and dragging them both into the centre of the table.

Before we move on, try dragging other binary variables into the contingency table to quickly see which combinations yield some interesting associations.

A more relevant example

Dragging variables one by one into a contingency table is a useful exploration step while evaluating potential predictors for use in a model. This implies that one of the variables must be the target (outcome) variable.

Let’s say you plan to build a predictive model to identify which alumni are more likely to make a donation at a higher level than usual, and you want to make a preliminary assessment of potential predictors so you can toss out any that don’t look promising, while keeping the others for use in the regression analysis. You can create a target variable using a lifetime giving variable that will allow you to do this analysis using contingency tables.

Working with the Data University file, create a new derived variable called ‘Big donor’ with the expression:

‘Lifetime HC’ > 999

This will create a binary variable that evaluates to ‘1’ for alumni with lifetime hard-credit giving greater than \$999, and ‘0’ for everyone else.

You can use whatever dollar value you like, but \$1,000 and up will work here. According to a frequency table of ‘Big donor’, 602 people have lifetime giving of \$1,000 or more.

Click on the icon for ‘Big donor’ to select it and create a contingency table from the Calc menu, and change the table options to include chi-square. Now the window is ready for testing variables one by one, by dragging variables into the space that is initially labeled “Drag Variable Here.”

Chi-square is great for indicating that two variables are related, but a little more information will help you understand the nature of the relationship. When you drag in “BusPhone Present,” for example, it seems apparent that the relationship is significant, to judge from the chi-square value. One tweak will tell you something more concrete: Go back to the table options, deselect “Count” and select “Percent of column total.”

This replaces cell counts with percent values – the percent breakdown for each column. The circled values in the table above are the ones to pay attention to – they are the cases that have a ‘1’ for ‘Big donor’. Here is what we can read from this:

• The first column of figures includes all cases for which there is no business phone in the database (0). Only 8.39% of people with no business phone are also top donors.
• The second column includes all cases with a business phone (1) – 18.4% of people with a business phone are top donors.

People with a business phone are more than twice as likely to be top donors, and the chi-square value indicates that the relationship is significant. It might be hasty to conclude that ‘BusPhone Present’ is truly a “predictor” of high-level giving, but the association is clearly there in the data.

(These percentages total on the columns. If ‘Big donor’ were your second variable instead of your first, the matrix would be ordered in the other direction, and you would choose “Percent of row total” from the table options instead.)

Try dragging other binary variables from the Data University set into the table, replacing ‘BusPhone Present’, and observe how the percent breakdowns differ from group to group, and whether that difference is significant.

In my book-in-progress, I go on to talk about exploring categorical variables with multiple categories – not just binary variables – but that’s all for today. Again, if you’re interested in knowing more as this project progress, please subscribe to this blog via email from the box in the right sidebar.

End notes

1) If you must know, the value of chi-square is the sum of the squared standardized residuals across all cells in the table. The standardized residual describes the extent to which the observed count differs from the expected count. You can display this statistic in Data Desk along with the others, but there is no need. Another abbreviation in the table is “df”, which stands for degrees of freedom. Probability values of chi-square depend upon the degrees of freedom, which in turn is related to the number of rows and columns in the table.

2) Use and misuse of p-values is a hot topic in statistics, especially in connection with academic publishing, where putting too much stock in p-values and the rather arbitrary 0.05 threshold causes a great deal of angst. As predictive modelers we need to keep these controversies in perspective: We are not testing the effects of new medical treatments nor are we seeking to publish in a scientific journal. We are making use of “good-enough” tools that will produce a useful result. That said, applying the 0.05 threshold without judgment and common sense means that we will detect “relationships” that are not real – potentially about 5% of the time!

## 26 August 2015

### Exploring associations between variables

Filed under: Book, CoolData, Predictor variables — Tags: , , , — kevinmacdonell @ 6:57 pm

CoolData has been quiet over the summer, mainly because I’ve been busy writing another book. (Fine weather has a bit to do with it, too.) The book will be for nonprofit and higher education advancement professionals interested in learning how to use multiple regression to build predictive models. Over the next few months, I will adapt various bits from the work-in-progress as individual posts here on CoolData.

I’ll have more to say about the book later, so if you’re interested, I suggest subscribing via email (see the box to the right) to have the inside track on this project. (And if you aren’t familiar with the previous book, co-written with Peter Wylie, then have a look here.)

A generous chunk of the book is about the specifics of getting your hands dirty with cleaning up your messy data, transforming it to make it suitable for regression analysis, and exploring it for interesting patterns that can strengthen a predictive model.

When you import a data set into Data Desk or other statistics package, you are looking at more than just a jumble of variables. All these variables are in a relation; they are linked by patterns. Some variables are strongly associated with each other, others have weaker associations, and some are hardly related to each other at all.

What is meant by “association”? A classic example is a data set of children’s weights, heights, and ages. Older children tend to weigh more and be taller than younger children. Heavier children tend to be older and taller than younger children. We say that there is an association between age and weight, between height and weight, and between age and height.

Another example: Alumni who are bigger donors tend to attend more reunion events than alumni who give more modestly or don’t give at all. Or put the other way, alumni who attend more events tend to give more than alumni who attend fewer or no events. There is an association between giving and attending events.

This sounds simple enough — even obvious. The powerful consequence of these truths is that if we know the value of one variable, we can make a guess at the value of another, as long as the association is valid. So if we know a child’s weight and height, we can make a good guess of his or her age. If we know a child’s height, we can guess weight. If we know how many reunions an alumna has attended, we can make a guess about her level of giving. If we know how much she has given, we can guess whether she’s attended more or fewer reunions than other alumni.

We are guessing an unknown value (say, giving) based on a known value (number of events attended). But note that “giving” is not really an unknown. We’ve got everyone’s giving recorded in the database. What is really unknown is an alum’s or a donor’s potential for future giving. With predictive modeling, we are making a guess at what the value of a variable will be in the (near) future, based on the current value of other variables, and the type and degree of association they have had historically.

These guesses will be far from perfect. We aren’t going to be bang-on in our guesses of children’s ages based on weight and height, and we certainly aren’t going to be very accurate with our estimates of giving based on event attendance. Even trickier, projecting into the future — estimating potential — is going to be very approximate.

Still, our guesses will be informed guesses, as long as the associations we detect are real and not due to random variation in our data. Can we predict exactly how much each donor is going to give over this coming year? No, that would be putting too much confidence in our powers. But we can expect to have plenty of success in ranking our constituents in order by how likely they are to engage in whatever behaviour we are interested in, and that knowledge will be of great value to the business.

Looking for potentially useful associations is part of data exploration, which is best done in full hands-on mode! In a future post I will talk about specific techniques for exploring different types of variables.

## 28 June 2015

### Data mining in the archives

Filed under: Data, Predictor variables — Tags: , , , — kevinmacdonell @ 6:24 pm

When I was a student, I worked in a university archives to earn a little money. I spent many hours penciling consecutive index numbers onto acid-free paper folders, on the ultra-quiet top floor of the library. It was as dull a job as one can imagine.

Today’s post is not about that kind of archive. I’m talking about database archive views, also called snapshots. They’re useful for reporting and business intelligence, but they can also play a role in predictive modelling.

What is an archive view?

Think of a basic stat such as “number of living alumni”. This number changes constantly as new alumni join the fold and others are identified as deceased. A straightforward query will tell you how many living alumni there are, but that number will be out of date tomorrow. What if someone asks you how many living alumni you had a year ago? Then it’s necessary to take grad dates and death dates into account in order to generate an estimate. Or, you look the number up in previously-reported statistics.

A database archive view makes such reporting relatively easy by preserving the exact status of a record at regular points in time. The ideal archive is a materialized view in a data warehouse. On a given schedule (yearly, quarterly, or even monthly), an automated process adds fresh rows to an archive table that keeps getting longer and longer. You’re likely reliant on central IT services to set it up.

“Number of living alumni” is an important denominator for such key ratios as the percentage of alumni for whom you have contact information (mail, phone, email) and participation rates (the proportion of alumni who give). Every gift is entered as an individual transaction record with a specific date, which enables reporting on historical giving activity. This tends not to be true of contact information. Even though mailing addresses may be added one after another, without overwriting older addresses, the key piece of information is whether the address is coded ‘valid’ or ‘invalid’. This changes all the time, and your database may not preserve a history of those changes. Contact information records may have “To” and “From” dates associated with them, but your query will need to do a lot of relative-date calculations to determine if someone was both alive and had a valid address for any given point in time in the past.

An archive table obviates the need for this complex logic, and ours looks like the example below. There’s the unique ID of each individual, the archive date, and a series of binary indicator variables — ‘1’ for “yes, this data is present” or ‘0’ for “this data is absent”.

Here we see three individuals and how their data has changed over three months in 2015. This is sorted by ID rather than by the order in which the records were added to the archive, so that you can see the journey each person has taken in that time:

• A00001 had no valid email in the database in February and March, but we obtained it in time for the April 1 snapshot.
• A00002 had no contact information at all until just before March, when a phone append supplied us with a new number. The number proved to be invalid, however, and when we coded it as such in the database, the indicator reverted to zero.
• A00003 appeared in our data in February and March, but that person was coded deceased in the database before April 1, and was excluded from the April snapshot.

That last bullet point is important. Once someone has died, continuing to include them as a row in the archive every month would be a waste of resources. In your reporting software, a simple count of records by archive date will give you the number of living alumni. A simple count of ‘Address Indicator’ will give you the number of alumni with valid addresses. Dividing the number of valid addresses by the number of living alumni (and multiplying by 100) will give you the percentage of living alumni that are addressable for that month. (Reporting software such as Tableau will make very quick work of all this.)

Because an archive view preserves changing statuses over time at the level of the individual constituent, it can be used for reporting trends along any slice you choose (age bracket, geography, school, etc.), and can play a role in staff activity/performance reporting and alumni engagement scoring.

But enough about archive views themselves. Let’s talk about using them for predictive modelling.

In the archive example above, you see a bunch of 0/1 indicator variables. Indicator variables are common in predictive modelling. For example, “Mailing address present” can have one of two states: Present or not present. It’s binary. A frequency breakdown of my data at this point in time looks like this (in Data Desk):

About 78% of living alumni have a valid address in the database today — the records with an address indicator of ‘1’. As you might expect, alumni with a good address are more likely to have given than alumni without, and they have much higher lifetime giving on average. In the models I build to predict likelihood to give (and give at higher levels), I almost always make use of this association between contact information and giving.

But what about using the archive view data instead? The ‘Address Indicator’ variable breakdown above shows me the current situation, but the archive view adds depth by going back in time. Our own archive has been taking monthly snapshots since December of last year — seven distinct points in time. Summing on “Address Indicator” for each ID shows that large numbers of alumni have either never had a valid address during that time (0 out of 7 months), or always did (7 out of 7). The rest had a change of status during the period, and therefore fall between 0 and 7:

A few hundred alumni (387) had a valid address in one out of seven months, 143 had a valid address in two out of seven — and so on. Our archive is still very young; only about 1% of alumni have a count that is not 0 or 7. A year from now, we can expect to see far more constituents populating the middle ground.

What is most interesting to me is an apparent relationship between “number of months with valid address” (x-axis) and average lifetime giving (y-axis), even with the relative scarcity of data:

My real question, of course, is whether these summed, continuous indicators really make much of a difference in a model over simply using the more familiar binary variables. The answer is “not yet — but someday.” As I noted earlier, only about 1% of living alumni have changed status in the past seven months, so even though this relationship seems linear, the numbers aren’t there to influence the strength of correlation. The Pearson correlation for “Address Indicator” (0/1) and “Lifetime Giving” is 0.186, which is identical to the Pearson correlation for “Address Count” (0 to 7) and “Lifetime Giving.” For all other variables except one, the archive counts have only very slightly higher correlations with Lifetime Giving than the straight indicator variables. (Email is slightly lower.)

It’s early days yet. All I can say is that there is potential. Have a look at this pair of regression analyses, both using Lifetime Giving (log-transformed) as the dependent variable. (Click on image for larger view.) In the window on the left, all the independent variables are the regular binary indicator variables. On the right, the independent variables are counts from our archive view. The difference in R-squared from one model to the other is very slight, but headed in the right direction: From 12.7% to 13.0%.

Looking back on my student days, I cannot deny that I enjoyed even the quiet, dull hours spent in the university archives. Fortunately, though, and due in no small part to cool data like this, my work since then has been a lot more interesting. Stay tuned for more from our archives.

## 7 January 2015

### New finds in old models

When you build a predictive model, you can never be sure it’s any good until it’s too late. Deploying a mediocre model isn’t the worst mistake you can make, though. The worst mistake would be to build a second mediocre model because you haven’t learned anything from the failure of the first.

Performance against a holdout data set for validation is not a reliable indicator of actual performance after deployment. Validation may help you decide which of two or more competing models to use, or it may provide reassurance that your one model isn’t total junk. It’s not proof of anything, though. Those lovely predictors, highly correlated with the outcome, could be fooling you. There are no guarantees they’re predictive of results over the year to come.

In the end, the only real evidence of a model’s worth is how it performs on real results. The problem is, those results happen in the future. So what is one to do?

I’ve long been fascinated with Planned Giving likelihood. Making a bequest seems like the ultimate gesture of institutional affinity (ultimate in every sense). On the plus side, that kind of affinity ought to be clearly evidenced in behaviours such as event attendance, giving, volunteering and so on. On the negative side, Planned Giving interest is uncommon enough that comparing expectancies with non-expectancies will sometimes lead to false predictors based on sparse data. For this reason, my goal of building a reliable model for predicting Planned Giving likelihood has been elusive.

Given that a validation data set taken from the same time period as the training data can produce misleading correlations, I wondered whether I could do one better: That is, be able to draw my holdout sample not from data of the same time period as that used to build the model, but from the future.

As it turned out, yes, I could.

Every year I save my regression analyses as Data Desk files. Although I assess the performance of the output scores, I don’t often go back to the model files themselves. However, they’re there as a document of how I approached modelling problems in the past. As a side benefit, each file is also a snapshot of the alumni population at that point in time. These data sets may consist of a hundred or more candidate predictor variables — a well-rounded picture.

My thinking went like this: Every old model file represents data from the past. If I pretend that this snapshot is really the present, then in order to have access to knowledge of the future, all I have to do is look at today’s data stored in the database.

For example, for this blog post, I reached back two years to a model I created in Data Desk for predicting likelihood to upgrade to the Leadership level in Annual Giving. I wasn’t interested in the model itself. Rather, I wanted to examine the underlying variables I had to work with at the time. This model had been an ambitious undertaking, with some 170 variables prepared for analysis. Many of course were transformations of variables or combinations of interacting variables. Among all those variables was one indicating whether a case was a current Planned Giving expectancy or not, at that point in time.

In this snapshot of the database from two years ago, some of the cases that were not expectancies would have become so since then. In other words, I now had the best of both worlds. I had a comprehensive set of potential predictors as they existed two years ago, AND access to the hitherto unknowable future: The identities of the people who had become expectancies after the predictors had been frozen in time.

As I said, my old model was not intended to predict Planned Giving inclination. So I built a new model, using “Is an Expectancy” (0/1) as the target variable. I trained the regression model on the two-year-old expectancy data — I didn’t even look at the new expectancies while building the model. No: I used those new expectancies as my validation data set.

“Validation” might be too strong a word, given that there were only 80 or so new cases. That’s a lot of bequest intentions, for sure, but in terms of data it’s a drop in the bucket compared with the number of cases being scored. Let’s call it a test data set. I used this test set to help me analyze the model, in a couple of ways.

First I looked at how new expectancies were scored by the model I had just built. The chart below shows their distribution by score decile. Slightly more than 50% of new expectancies were in the top decile. This looks pretty good — keeping in mind that this is what actual performance would have looked like had I really built this model two years ago (which I could have):

(Even better, looking at percentiles, most of the expectancies in that top 10% are concentrated nicely in the top few percentiles.)

But I didn’t stop there. It is also evident that almost half of new expectancies fell outside the top 10 percent of scores, so clearly there was room for improvement. My next step was to examine the individual predictors I had used in the model. These were of course the predictors most highly correlated with being an expectancy. They were roughly the following:
• Year person’s personal information in the database was last updated
• Number of events attended
• Age
• Year of first gift
• Number of alumni activities
• Indicated “likely to donate” on 2009 alumni survey
• Total giving in last five years (log transformed)
• Combined length of name Prefix + Suffix

I ranked the correlation of each of these with the 0/1 indicator meaning “new expectancy,” and found that most of the predictors were still fine, although they changed their order in the rank correlation. Donor likelihood (from survey) and recent giving were more important, and alumni activities and how recently a person’s record was updated were less important.

This was interesting and useful, but what was even more useful was looking at the correlations between ALL potential predictors and the state of being a new expectancy. A number of predictors that would have been too far down the ranked list to consider using two years ago were suddenly looking much better. In particular, many variables related to participation in alumni surveys bubbled closer to the top as potentially significant.

This exercise suggests a way to proceed with iterative, yearly improvements to some of your standard models:
• Dig up an old model from a year or more ago.
• Query the database for new cases that represent the target variable, and merge them with the old datafile.
• Assess how your model performed or, if you created more than one model, see which model would have performed best. (You should be doing this anyway.)
• Go a layer deeper, by studying the variables that went into those models — the data “as it was” — to see which variables had correlations that tricked you into believing they were predictive, and which variables truly held predictive power but may have been overlooked.
• Apply what you learn to the next iteration of the model. Leave out the variables with spurious correlations, and give special consideration to variables that may have been underestimated before.

## 16 January 2012

### Address updates and affinity: Consider the source

Filed under: Correlation, Predictor variables, skeptics — Tags: , , , , — kevinmacdonell @ 1:03 pm

Some of the best predictors in my models are related to the presence or absence of phone numbers and addresses. For example, the presence of a business phone is usually a highly significant predictor of giving. As well, a count of either phone or address updates present in the database is also highly correlated with giving.

Some people have difficulty accepting this as useful information. The most common objection I hear is that such updates can easily come from research and data appends, and are therefore not signals of affinity at all. And that would be true: Any data that exists solely because you bought it or looked it up doesn’t tell you how someone feels about your institution. (Aside from the fact that you had to go looking for them in the first place — which I’ve observed is negatively correlated with giving.)

Sometimes this objection comes from someone who is just learning data mining. Then I know I’m dealing with someone who’s perceptive. They obviously get it, to some degree — they understand there’s potentially a problem.

I’m less impressed when I hear it from knowledgeable people, who say they avoid contact information in their variable selection altogether. I think that’s a shame, and a signal that they aren’t willing to put in the work to a) understand the data they’re working with, or b) take steps to counteract the perceived taint in the data.

If you took the trouble to understand your data (and why wouldn’t you), you’d find out soon enough if the variables are useable:

• If the majority of phone numbers or business addresses or what-have-you are present in the database only because they came off donors’ cheques, then you’re right in not using it to predict giving. It’s not independent of giving and will harm your model. The telltale sign might be a correlation with the target variable that exceeds correlations for all your other variables.
• If the information could have come to you any number of ways (with gift transactions being only one of them), then use with caution. That is, be alert if the correlation looks too good to be true. This is the most likely scenario, which I will discuss in detail shortly.
• If the information could only have come from data appends or research, then you’ve got nothing much to worry about: The correlation with giving will be so weak that the variable probably won’t make it into your model at all. Or it may be a negative predictor, highlighting the people who allowed themselves to become lost in the first place. An exception to the “don’t worry” policy would be if research is conducted mainly to find past donors who have become lost — then there might be a strong correlation that will lead you astray.

An in-house predictive modeler will simply know what the case is, or will take the trouble to find out. A vendor hired to do the work may or may not bother — I don’t know. As far as my own models are concerned, I know that addresses and phone numbers come to us via a mix of voluntary and involuntary means: Via Phonathon, forms on the website, records research, and so on.

I’ve found that a simple count of all historical address updates for each alum is positively correlated with giving. But a line plot of the relationship between number of address updates and average lifetime giving suggests there’s more going on under the surface. Average lifetime giving goes up sharply for the first half-dozen or so updates, and then falls away just as sharply. This might indicate a couple of opposing forces: Alumni who keep us informed of their locations are more likely to be donors, but alumni who are perpetually lost and need to be found via research are less likely to be donors.

If you’re lucky, your database not only has a field in which to record the source of updates, but your records office is making good use of it. Our database happens to have almost 40 different codes for the source, applied to some 300,000 changes of address and/or phone number. Not surprisingly, some of these are not in regular use — some account for fewer than one-tenth of one percent of updates, and will have no significance in a model on their own.

For the most common source types, though, an analysis of their association with giving is very interesting. Some codes are positively correlated with giving, some negatively. In most cases, a variable is positive or negative depending on whether the update was triggered by the alum (positive), or by the institution (negative). On the other hand, address updates that come to us via Phonathon are negatively correlated with giving, possibly because by-mail donors tend not to need a phone call — if ‘giving’ were restricted to phone solicitation only, perhaps the association might flip toward the positive. Other variables that I thought should be positive were actually flat. But it’s all interesting stuff.

For every source code, a line plot of average LT giving and number of updates is useful, because the relationship is rarely linear. The relationship might be positive up to a point, then drop off sharply, or maybe the reverse will be true. Knowing this will suggest ways to re-express the variable. I’ve found that alumni who have a single update based on the National Change of Address database have given more than alumni who have no NCOA updates. However, average giving plummets for every additional NCOA update. If we have to keep going out there to find you, it probably means you don’t want to be found!

Classifying contact updates by source is more work, of course, and it won’t always pay off. But it’s worth exploring if your goal is to produce better, more accurate models.

## 3 September 2010

### More on making hay from variables that interact

A short while ago I wrote about pairs of predictor variables that are highly correlated with each other, i.e. that have strong interactions in regression analysis. (Making hay when predictor variables interact.) The example I used was Position Present and Employer Name Present. Instead of using one and throwing the other away as redundant, you can combine them to form a new variable with more predictive power than either of the original two on their own.

In this post, I’ll show you how to identify other likely pairs of variables from which you can try to make similar “combination variables.”

When independent variables interact in regression, it’s called multicollinearity. (You can read a good discussion of multicollinearity on the Stats Make Me Cry blog: Top Ten Confusing Stats Terms Explained in Plain English, #9: Multicollinearity.) Position Present and Employer Name Present is an obvious example, but all kinds of subtle combinations are possible and difficult to foresee. We need to call on some help in detecting the interaction. That help is provided by Pearson’s Product-Moment Correlation, also known as Pearson’s r. I’ve written about Pearson’s r before.

In a nutshell, Pearson’s r calculates a number that describes the strength of linear correlation between any two variables. Your stats software makes this easy. In DataDesk, I select all the icons of the variables I want to calculate correlations for, then find Pearson’s r in the menu. The result is a new window containing a table full of values. If you include many variables at once, this table will be massive. Click on the image below for a full-size version of a Pearson table based on some real data from a university. (You might have to enlarge it in your browser.) The table works exactly like those distance tables you find on old tourist highway maps (they don’t seem to make those anymore — wonder why); to find the distance from, say Albuquerque to Santa Fe, you’d find the number at the intersection of the Albuquerque column with the Santa Fe row, and that would be the number of miles to travel. In the table below, the cities are variables and the mileage is Pearson’s r.

Don’t be intimidated by all the numbers! Just let your eye wander over them. Notice that some are positive, some negative. The negative sign simply means that the correlation between the two variables is negative. Notice also that most of the numbers are small, less than 0.1 (or minus 0.1). As far as multicollinearity is concerned, we’re most interested in these bigger values, i.e. values that are furthest from zero (no correlation) and closest to 1 (perfect linear correlation).

I realize some of the variable names will be a bit mysterious, but you might be able to guess that “Number deg” is Number of Degrees and that “Grad Y” means Graduated. Their Pearson’s r value (0.26) is one of the higher correlations, which makes sense, right?

Noticing certain correlations can teach you things about the data. ‘Female’ is correlated with ‘Class (Year)’ — because at this university, males outnumbered females years ago, but since the 1980s, females have outnumbered males by an ever-increasing factor. On the other hand, ‘Number HC’ (campus reunions attended) is negatively correlated with ‘Class Year’ — older alumni have attended more events (no big surprise), but also young alumni are not big on reunions at this institution.

Look at ‘Business Phone Present’ and ‘Employer Present’. Their r value is relatively high (0.376). I would test some variations of those two. You could add them together, so that the variable could range from 0 to 2. Or you could multiply them, to give you a binary variable that would have a value of 1 only if both of the original variables was 1. You might end up with a predictor that is more highly correlated with ‘Giving’ than either of the original two variables.

With non-binary variables such as ‘Class Year’ and ‘Number Events Attended’, the results of combining will be even more varied and interesting. What you do is up to you; there’s no harm in trying. When you’re done playing, just rank all your old and new variables in order by the absolute value of their strength of linear correlation with your predicted value (say, Giving), and see how the new variables fare.

Older Posts »