CoolData blog

13 June 2016

Nifty SQL regression to calculate donors’ giving trends

Filed under: Coolness, Predictor variables, regression, SQL — Tags: , , , — kevinmacdonell @ 8:28 pm

Here’s a nifty bit of SQL that calculates a best-fit line through a donor’s years of cash-in giving by fiscal year (ignoring years with no giving), and classifies that donor in terms of how steeply they are “rising” or “falling”.

I’ll show you the sample code, which you will obviously have to modify for your own database, and then talk a little bit about how I tested it. (I know this works in Oracle version 11g. Not sure about earlier versions, or other database systems.)

```with sums AS (
select t1.id, t1.fiscal_year, log(10, sum(t1.amount)) AS yr_sum
group by t1.id, t1.fiscal_year),

slopes AS (
select distinct
sums.id,
regr_slope(sums.yr_sum,sums.fiscal_year) OVER (partition by sums.id) AS slope

from sums
)

select
slopes.id,
slopes.slope,
CASE
when slopes.slope is null then 'Null'
when slopes.slope >=0.1 then 'Steeply Rising'
when slopes.slope >=0.05 then 'Moderately Rising'
when slopes.slope >=0.01 then 'Slightly Rising'
when slopes.slope >-0.01 then 'Flat'
when slopes.slope >-0.05 then 'Slightly Falling'
when slopes.slope >-0.1 then 'Moderately Falling'
else 'Steeply Falling' end AS description

from slopes```
That’s it. Not a lot of SQL, and it runs very quickly (for me). But does it actually tell us anything?

I devised a simple test. Adapting this query, I calculated the “slope of giving” for all donors over a five-year period in the past: FY 2007 to FY 2011. I wanted to see if this slope could predict whether, and by how much, a donor’s giving would rise or fall in the next five-year period: FY 2012 to FY 2016. (Note that the sum of a donor’s giving in each year is log-transformed, in order to better handle outlier donors with very large giving totals.)

I assembled a data file with each donor’s sum of cash giving for the first five-year period, the slope of their giving in that period, and the sum of their cash giving for the five-year period after that.

The first test was to see how the categories of slope, from Steeply Rising to Steeply Falling, translated into subsequent rises and falls. In Data Desk, I compared the two five-year periods. If the second period’s giving was greater than the first, I called that a “rise.” If it was less, I called it a “fall.” And if it was exactly the same, I called it “Same.”

The table below summarizes the results. Note that these numbers are all percentages, summed horizontally. (I will explain the colour highlighting later on.)

For Steeply Rising, 60.6% of donors actually FELL from the first period to the next. Only 37.8 percent rose, and just 1.6% stayed exactly the same. Not terribly impressive. Look at Steeply Falling, though: More than three-quarters actually did fall. That’s a better result, but then again, “Falling” dominates for every category; in the whole file, close to 70% of all donors reduced their giving in the next period. If a donor has no giving in the second period of five years, that’s zero dollars given, and this is called a “Fall” — more on that aspect in just a sec.

(I’ve left out donors with a FY2007-11 slope of Null — they’re the ones who gave in only one year and therefore don’t have a “slope”.)

Let’s not give up just yet, however. The colour highlighting indicates how high each percentage value is in relation to those above and below it. For example, the highest percentages in the Falling column are found in the Slightly, Moderately, and especially Steeply Falling slope categories. The highest percentages in the Rising column are in the Slightly, Moderately, and Steeply Rising slope categories. And in the Same column, the Flat slope wins hands-down — as we would hope.

So a rising slope “sort of” predicts increased giving, a falling slope “sort of” predicts decreased giving. Unfortunately, many donors are not retained into the second five-year period, so there’s not a lot to be confident about.

But what if a donor IS retained? What if we exclude the lapsed donors entirely? Let’s do that:

Excluding non-donors seems to lead to an improvement … The slope does a better job sorting between the risers and fallers when a donor is actually retained. Again, the colour highlighting is referencing columns, not rows. But notice now that, across the rows, Rising has a slight majority for the Rising slope categories, and Falling has a slight majority for the Falling slope categories. (The bar is set too high for Flat, however, given that a donor’s giving in the first five years has to be exactly equal to her giving in the second five years to be called Same.)

Admittedly, these majorities are not generous. If I calculated a donor’s slope of giving as Steeply Rising and that donor was retained, I have only a 56.4% chance of actually being right. And of course there’s no guarantee that donor won’t lapse.

(Note that these are donors of all types — alumni, non-alumni individuals, and entities such as corporations and foundations. Non-alumni donors tend not to have patterns in their giving that are repeated, not to the extent that alumni do. However, when I limit the data file to alumni donors only, the improvement in this method is only very slight.)

Pressing on … I did a regression analysis using total giving in the second five-year period as the dependent variable, then entered total giving in the prior five-year period as an independent variable. (Naturally, R-squared was very high.) This allowed me to see if Slope provides any explanatory power when it is added as the second independent variable — the effect of giving in the first five-year period already being accounted for.

And the answer is, yes, it does. But only under specific conditions: Both five-year giving totals were log-transformed and, most significantly, donors who did not give in the second period were excluded from the regression.

There are other way to assess the usefulness of “slope” which might lead to an application, and I encourage you to give this a try with your own data. From past experience I know that donors who make big upgrades in giving don’t have any neat universal pattern such as an upward slope in their giving history. (The concept of volatility is explored here and here.) “Slope” is probably too simple a characteristic to employ on its own.

But as I’ve said before, if it were easy, obvious, or intuitive, it wouldn’t be data analysis.

26 January 2012

More mistakes I’ve made

Filed under: Best practices, Peter Wylie, Pitfalls, Validation — Tags: , , , — kevinmacdonell @ 1:38 pm

A while back I wrote a couple of posts about mistakes I’ve made in data mining and predictive modelling. (See Four mistakes I have made and When your predictive model sucks.) Today I’m pleased to point out a brand new one.

The last days of work leading up to Christmas had me evaluating my new-donor acquisition models to see how well they’ve been working. Unfortunately, they were not working well. I had hoped — I had expected — to see newly-acquired donors clustered in the upper ranges of the decile scores I had created. Instead they were scattered all along the whole range. A solicitation conducted at random would have performed nearly as well.

Our mailing was restricted by score (roughly the top two deciles only), but our phone solicitation was more broad, so donors came from the whole range of deciles:

Very disappointing. To tell the truth, I had seen this before: A model that does well predicting overall participation, but which fails to identify which non-donors are most likely to convert. I am well past the point of being impressed by a model that tells me what everyone already knows, i.e. that loyal donors are most likely to give again. I want to have confidence that acquisition mail dollars are spent wisely.

So it was back to the drawing board. I considered whether my model was suffering from overfit, whether perhaps I had too many variables, too much random noise, multicolinearity. I studied and rejected one possibility after another. After so much effort, I came rather close to concluding that new-donor acquisition is not just difficult — it might be darn near impossible.

Dire possibility indeed. If you can’t predict conversion, then why bother with any of this?

It was during a phone conversation with Peter Wylie that things suddenly became clear. He asked me one question: How did I define my dependent variable? I checked, and found that my DV was named “Recent Donors.” That’s all it took to find where I had gone wrong.

As the name of the DV suggested, it turned out that the model was trained on a binary variable that flagged anyone who had made a gift in the past two years. The problem was that included everybody: long-time donors and newly-acquired donors alike. The model was highly influenced by the regular donors, and the new donors were lost in the shuffle.

It was a classic case of failing to properly define the question. If my goal was to identify the patterns and characteristics of newly-acquired donors, then I should have limited my DV strictly to non-donors who had recently converted to donors!

So I rebuilt the model, using the same data file and variables I had used to build the original model. This time, however, I pared the sample down to alumni who had never given a cent before fiscal 2009. They were the only alumni I needed to have scores for. Then I redefined my dependent variable so that non-donors who converted, i.e., who made a gift in either fiscal 2009 or 2010, were coded ‘1’, and all others were coded ‘0’. (I used two years of giving data instead of just one in order to have a little more data available for defining the DV.) Finally, I output a new set of decile scores from a binary logistic regression.

A test of the new scores showed that the new model was a vast improvement over the original. How did I test this? Recall that I reused the same data file from the original model. Therefore, it contained no giving data from the current fiscal year; the model was innocent of any knowledge of the future. Compare this breakdown of new donors with the one above:

Much better. Not fan-flippin-tastic, but better.

My error was a basic one — I’ve even cautioned about it in previous posts. Maybe I’m stupid, or maybe I’m just human. But like anyone who works with data, I can figure out when I’m wrong. That’s a huge advantage.

• Be skeptical about the quality of your work.
• Evaluate the results of your decisions.
• Document your mistakes and learn from them.
• Stay humble.

22 March 2011

Thoughts on model validation

Filed under: Model building, Pitfalls, regression, Validation — Tags: , — kevinmacdonell @ 11:46 am

I have written very little about model validation on CoolData, probably because I’ve always had conflicting thoughts about it.

Model validation is important for avoiding serious error when predicting behaviours that are relatively rare, such as major or planned giving. But if the event is rare, the data is sparse. When I need to train the data on a relatively small number of cases, I am loathe to rob my model of half the cases I need for training.

In an annual fund model it’s no big deal. If the predicted value is participation in the fund, that’s not a rare event and there is plenty of data. There’s no angst about going the prescribed route: splitting the file into two random halves before modeling begins — a training sample used to calculate the scores, and a validation sample on which to test the validity of those scores. The irony is, most annual fund models are robust and hardly require validation.

It’s exactly when doing the tricky modeling projects (major giving, planned giving) that validation is most important. If a major-giving model proves valid, that’s great — but it might have been even stronger had I kept the data file intact. If it proves faulty, how much of that can be blamed on the fact that half my cases are unavailable?

Validation of a model has very little to do with R squared, by the way. That statistic measures only how well a multiple linear regression model fits the data set you’re working with. It doesn’t tell you how well it will perform for prediction. A very high (60% or more) or very low (10% or less) value for R squared signals trouble with the design of your model; most models will fall into that safe zone, regardless of their usefulness for prediction.

My thinking is this. Validating a model has one of two results: Either you accept the model, or you throw it out. Unless you’ve got other scoring or segmentation tricks up your sleeve, the alternative to having a predictive model is using nothing at all. How often does implementing a model prove worse than having no model? I am willing to wager: almost never. Even a score set that contains a certain amount of random noise is an improvement over not using any score. If you’re going to build a single model and use it come what may, flaws and all, why bother to validate it? Not to mention that there is no clear dividing line between a valid model and an invalid one.

The key is that I do not create one model, but several. Creating more than one model gives me options. I no longer have to compare the validity of my model against some arbitrary rule of thumb — I can compare multiple models against each other and choose the best one. Instead of splitting my data file in half and losing half of my precious training cases, I set aside a small but reasonably representative random sample of cases. This sample is held out of the model, but also scored, allowing me to compare the relative success of each model.

The benefits are twofold: One, I don’t lose nearly as many cases to the validation set, and two, I can still get some idea about how good the model is. If all the models are lousy, it’s going to show up in how the holdout cases are distributed by score.

Here is an example. Recently I needed to develop a model to predict propensity to give at higher levels. The table below summarizes how I set out to build two very different models.

Without going into too much detail, you can see that I could have chosen to build a number of different models by mixing and matching the criteria available to me:

1. binary logistic regression or multiple linear regression?
2. include everyone in the model (even non-donors), or a narrowed-down selection of individuals?
3. binary dependent variable (set to represent any level of giving), or a continuous dependent variable (LT giving)?
4. giving-related variables included or excluded?

However, in the interest of time and other practical concerns I chose only two contrasting options. As well, some choices are made for me: If I want to score non-donors in my model, I am obliged to leave out all giving-related variables, for example. (I’ve written plenty about these scenarios recently, so I won’t get into that again.)

The one thing held constant between the models I built was the holdout sample. I chose 10 current major donors at random and held them out of both models. The models were innocent of these donors’ status, but used their characteristics to assign scores to them anyway. So whichever model did a better job of giving high scores to this sample was the one I chose to use in our program.

Only ten individuals? Yes, that’s a tiny holdout sample, but when you’re trying to model for propensity to give at an exclusive level, you need to conserve your precious training cases. As it turned out, ten holdout cases was enough to reveal differences in reliability. I was surprised at which model ended up winning the trophy — but that’s a post for another time.

14 March 2011

Correlation and you

Filed under: Correlation, Predictor variables, regression, Statistics — Tags: , , — kevinmacdonell @ 7:25 am

If you read books and blogs on statistics, eventually your understanding of even the most basic concepts will start to smear. Things we think ought to be well-established by now are matters of controversy. Ranking high on the list of most slippery concepts is correlation. In today’s post, I’m going to make the concept very complicated, and then I’m going to dismiss the complexity and make it all simple again. I’m telling you this in advance so I don’t lose you partway through.

Correlation is the foundation of predictive modeling. The degree to which one variable x, changes its value in relation to a second variable y, either positively or negatively, is the very definition of x‘s usefulness in predicting the value of y. The tool I use to quantify the strength of that relationship is Pearson Product-Moment Correlation, or Pearson’s r, which some statistics texts simply call “correlation”. (It would help if you read the blog post on Pearson before reading this one.)

Before we go any farther, I need to explain why I want to quantify the strength of relationships. At the start of any modeling project, I have a hundred or so potential predictor variables in my data file. Some are going to be excellent predictors, most will be only so-so, and others will have little or no association with the outcome variable. I want to introduce variables into the regression analysis in an order that makes sense, so that the best predictors are added first. Due to the complexity of interactions among variables, there is no telling in advance what will actually happen as variables are added, so any list of variables ordered by strength of correlation is merely a rough guide.

What I DON’T use Pearson’s r for much is exploring variables. At the exploration stage, before modeling begins, I will look at a variable a number of ways in order to get a sense of how valuable the variable will be, and how I might want to transform it or re-express it to make it better. I will look at how the variable is distributed, and compare average and median giving between groups (for example, Home Phone Present, Y/N). These and other techniques, as described in Peter Wylie’s book “Data Mining for Fundraisers,” are simpler, more direct, and often more helpful than abstract measures of correlation.

It’s only after I’ve done the exploration work and tweaked the variables for maximum effect that I’m ready to rank them in order by their correlation values. So, with that out of the way, let’s look at Pearson’s r in more detail.

The textbooks make it abundantly clear: Pearson’s r quantifies the relationship between two continuous variables that are linearly related. Right away, we’ve got a problem: Very few of the variables I work with are continuous, and most of the relationships I see do not meet the definition of “linear”. Yet, I use Pearson’s r exclusively. Does this mean I’ve been misusing and abusing the method?

I don’t see it that way, but it is interesting to read how these things are discussed in the literature.

You can assess whether a relationship between two variables is roughly linear by looking at a scatterplot of the variables. Below is a scatterplot of ‘Giving’ (log-transformed) and ‘Age’, created in Data Desk. It’s a big, messy cloud of points (some 80,000 of them!). A lot of the relationship is hidden by overplotting (the overlapping of points) along the bottom line — that row of points at the zero giving mark represents non-donors, and there are many more of them near the young end of that line than there are near the older end. Still, at least you can see a vague linear relationship in the upward fanning of the data: As age increases, so does lifetime giving. A best-fit line through the data would slope upward from left to right, and therefore the Pearson correlation value is high.

We’ve got two continuous variables, and a linear relationship, so we get the Statistician’s Seal of Approval: It’s okay to use Pearson’s r to measure strength of correlation between these two variables.

Unfortunately, as I said, most of the variables we use in predictive modeling are not continuous, and they don’t look like much of anything in a scatterplot. Here’s a scatterplot of a Likert-scale survey response. The survey question asked alumni how likely they are to donate to alma mater, and the scale runs from 1 to 5, with 5 being “very likely.” This is not a continuous variable, because there are no possible intermediate values among the five levels. It’s ordinal. The plot with Lifetime Giving is difficult to interpret, but it sure isn’t linear:

Yes, the line of points for the highest response, 5, does extend higher than any other line in terms of lifetime giving. But due to overplotting, there is no way to tell how many nondonors lurk in the single dots that appear at the foot of every line and which indicate zero dollars given. This in no way resembles a cloud of points through which one can imagine a best-fit line being drawn. I can TELL you that a positive response to this question is strongly associated with high levels of lifetime giving, and it is, but you could be forgiven for remaining unconvinced by this “evidence”.

Even worse are the most common predictor variables: indicator variables, in binary form (0/1). For example, let’s say I express the condition of being ‘Married’ as a binary variable. A scatterplot of ‘Giving’ and ‘Married’ is even less useful than the one for the survey question:

Yuck! We’ve got 80,000 data points all jammed up in two solid lines at zero (not married) and 1 (married). We can’t tell from this plot, but it just so happens that the not-married line contains a lot of points sitting at lifetime giving of zero, far more than the married line. Being married is associated with giving, but who could tell from this? There’s no way this relationship is linear.

One of the tests for the appropriateness of using Pearson’s r is whether a scatterplot of the variable looks like a “straight enough” line. Another test is that both variables are quantitative and continuous — not categorical (or ordinal) and discrete. The fact that I can take a categorical variable (Married) and re-express it as a number (0/1) makes no difference. Turning it into a number makes it possible to calculate Pearson’s r, but that doesn’t make it okay to do so.

So the textbooks tell us. What else do they have to say? Read on.

There are methods other than Pearson’s r which we can use to measure the degree of association between two variables, which do not require the presence of a linear relationship. One of these is Spearman’s Rho, which is the correlation between the ranks of two variables. Rho replaces the data values themselves with their ranks within each variable — so the lowest value in each variable becomes ‘1’, the next lowest becomes ‘2’, and so on — and then calculates to what degree those values are related between the two variables, either negatively or positively.

Spearman’s Rho is sometimes called Spearman Rank Correlation, which muddies the waters a bit as it implies that it’s a measure of correlation. It is, but the correlation is between the ranks, not the data values themselves. The bottom line is that a statistician would tell us that Spearman is the appropriate calculation for putting a value to the strength of association between two variables that are not linearly related but which may show a consistently increasing or decreasing trend. Unlike Pearson’s r, it is a nonparametric method — it is free of any requirement that the distribution of the variables look a certain way.

Spearman’s Rho has some special properties which I won’t get into, but overall it looks a lot like Pearson’s r. It can take on values between -1 and 1 (a perfect negative relationship to a perfect positive relationship, both called monotone relationships); values near zero indicate the absence of a relationship — just like Pearson’s r. And your stats software makes it equally easy to compute.

So, great. We’ve got Spearman’s Rho, which we are told is just the thing for analyzing the Likert scale variable I showed you earlier. What about the indicator variable (for ‘Married’)? Well, no, we’re told: You can’t use Pearson or Spearman’s to calculate correlation for categorical variables. For that, you need the point-biserial correlation coefficient.

Huh?

That’s right, another measure of correlation. In fact, there are many types of measures out there. I’ve got a sheet in front of me right now that lists more than eight different measures, the choice of which depends on the combination of variables you’re analyzing (two continuous variables, one continuous and one ordinal variable, two binary variables, one binary and one ordinal, etc. etc.). And that list is not exhaustive.

Another wrinkle is that some texts don’t call these relationships “correlation” at all. By a strict definition, if a relationship between variables it isn’t appropriate for Pearson’s r, then it ain’t correlation. We are supposed to call it by the more vague term “association.”

Hey, I’m cool with that. I’ll call it whatever you want. But what are the practical implications of all this?

As near as I can tell, ZERO.

Uh-huh, I’ve just made you read more than a thousand words on how complicated correlation is. Now I’m going to dismiss all of that with an imperial wave of my hand. I need to you ignore everything I’ve just said, for two reasons which I will elaborate on in a moment:

1. As I stated at the outset, the only reason I calculate correlations is to explore which variables are most likely to figure prominently in a regression analysis. For a rough ranking of variables, Pearson’s r is a “good enough” tool.
2. The correlation r is the basis of linear regression, which is our end-goal. Not Spearman, not point-biserial, nor any other measure.

Regarding point number one: We are not concerned about the precise value of the calculated association, only the approximate ranking of our variables. All we want to know is: which variables are probably most valuable for our model and should be added to the regression first? As it turns out, rankings using other measures of correlation (sorry — association) hardly vary from a Pearson’s r ranking. It’s extra bother for nothing.

And to point number two: There’s a real disconnect between what the textbooks say about correlation analysis and what they say about regression. It seems to me that if Pearson’s r is inappropriate for all but continuous, linearly-related variables, then we would also be told that only continuous, linearly-related variables can be used in regression. That’s not the case: Social-science researchers and modelers toss ordinal and binary variables into regressions with wild abandon. If we didn’t we’d have almost nothing left to work with.

The disconnect is bridged with an explanation I found in one university stats textbook. It’s touched on only briefly, and towards the end of the book. The gist is this: For 0/1 indicator variables added to a linear regression, the coefficient of correlation is not the slope of a line, as we are always told to understand it. The indicator acts to vertically shift the line, so that instead of one regression slope, we have two: The unshifted line if the indicator variable is equal to zero, and another line shifted vertically up or down (depending on the sign of the coefficient), if the value is 1. This seems like essential information, but hardly rates any discussion at all. That’s stats for you.

To summarize:

1. Don’t be sidetracked by warnings regarding measures of association/correlation that have specific uses but do not relate to your end goal: Building a regression model for the pragmatic purpose of making predictions.
2. Most of the time, scatterplots can’t tell you what you need to know, because most of our data is categorical.
3. Indicator variables (and ordinal variables) are materially different from the pretty, linearly-related variables, and they are absolutely OK for use in regression.
4. SAY “association”, but DO “correlation”.
5. Don’t feel bad if you’re having difficulty learning predictive modeling from a stats textbook. I can’t see how anyone could.

29 June 2010

Choosing the right flavour of regression

Filed under: Model building, regression, Statistics, Terminology, Uncategorized — Tags: , , — kevinmacdonell @ 5:50 am

(Creative Commons license. Click image for source.)

I use two types of regression analysis to build predictive models: multiple linear regression and binary logistic regression. Both are called “regression”, but they are very different animals. You can use either one to build a model, but which one is best for fundraising models?

The answer is that there is no best option that applies across the board. It depends on what you’re trying to predict, certainly, but even more so it depends on the data itself. The best option will not be obvious and will be revealed only in testing. I don’t mean to sound careless about proper statistical practice, but we work in the real world: It’s not so much a question of “which tool is most appropriate?” as “Which tool WORKS?”

One of the primary differences between the two types of regression is the definition of the dependent variable. In logistic regression, this outcome variable is either 1 or 0. (There are other forms of logistic regression with multiple nominal outcomes, but I’ll stick to binary outcomes for now.) An example might be “Is a donor / Is not a donor,” or “Is a Planned Giving expectancy/ Is not a planned giving expectancy.”

In multiple regression, the dependent variable is typically a continuous value, like giving expressed in real dollars (or log-transformed dollars). But the DV can also be a 0/1 value, just as in logistic regression. Technically using a binary variable violates one of the assumptions underlying multiple regression (a normal probability distribution of the DV), but that doesn’t necessarily invalidate the model as a powerful predictive tool. Again, what works?

Another difference, less important to my mind, is that the output of a multiple regression analysis is a predicted value that reflects the units (say, dollars) of the DV and may be interpretable as such (predicted lifetime giving, predicted gift amount, etc.), while the output of a logistic regression is a probability value. My practice is to transform both sorts of outputs into scores (deciles and percentiles) for all individuals under study; this allows me to refer to both model outputs simply as “likelihood” and compare them directly.

So which to use? I say, use both! If you want some extra confidence in the worth of your model, it isn’t that much trouble to prepare both score sets and see how they compare. The key is having a set of holdout cases that represent the behaviour of interest. If your model is predicting likelihood to become a Planned Giving expectancy, you first set aside some portion of existing PG expectancies, build the model without them, then see how well the model performed at assigning scores to that holdout set.

You can use this method of validation when you create only one model, too. But where do you set the bar for confidence in the model if you test only one? Having a rival model to compare with is very useful.

In my next post I will show you a real-world example, and explain how I decided which model worked best.

9 June 2010

Why multiple regression?

Filed under: Model building, regression, Statistics — Tags: , , , — kevinmacdonell @ 6:11 am

Not long ago I wrote about Pearson’s r, also known as Pearson’s Product-Moment Correlation Coefficient. This is a convenient statistical tool available in any stats software program (Excel can calculate it too) that yields a numerical measure of the strength of the correlation (linear dependence) between any two variables, X and Y.

I will show you how you can easily create a predictive score using only Pearson’s r — and why you probably shouldn’t!

Pearson’s r points the way toward weighting your predictor variables, according to how strongly correlated with your predicted value they actually are. If you assume all your predictor variables are worth the same (“1” for positive predictors and “-1” for negative predictors) you are imposing a subjective weighting on your variables. (I wrote about this limitation of the simple score in my most recent post, Beyond the simple score.)

Is Homecoming attendance more or less predictive than presence of an email address? Pearson’s r will tell you. Look at the example in the table above. Ten common predictor variables are listed in order of their strength of correlation with Lifetime Giving (log-transformed). It just so happens that the highest two correlations are negative; even though they are negative, according to their absolute value they are stronger correlations than any of the others in the list, so I put them at the top.  According to these values, “Email present” is relatively weak compared with “Number of Homecomings attended.”

Do you see where this is going? If you wanted to, you could use these correlations to directly create weighted scores for everyone in your database. Just multiply each variable by its Pearson value, sum up the products, and bingo — there’s your score.

You could do that, but I don’t think you should.

At least two of the predictor variables we want to use in our score are very closely related to each other: Employer present and Position present. They aren’t exactly alike: For some constituents you will have one piece of information and not the other. But on the whole, if one is present in your database for any given constituent, chances are you’ve got the other as well.

In other words, if you include both variables in your score, you’re double-counting the effect of employment information in your model — despite the fact that each is properly weighted by Pearson score. The reason is that Pearson’s r treats only two variables at a time: X and Y. It does NOT account for any interactions between multiple Xs.

Employment variables are only an obvious example. All of your variables will interact with each other to some degree, some strongly, others more subtly. By “interact with each other” I mean “correlate with each other.” In fact, we can use Pearson’s r to show which combinations of predictor variables are strongly correlated. The table below lists three variable pairs, drawn from real data, that exhibit strong interactions — including the employment example we’ve just mentioned.

The Pearson value for the employment variables is very close to 1, which indicates a nearly perfect positive correlation. The other two are more subtle, but make sense as well: Younger alumni will tend to be coded as Single in your database, and if we have a job title for someone, chances are we’ll also have a business phone number as well.

This overlapping of the explanatory effect of various X’s on Y will interfere with our ability to properly weight our predictors. Pearson’s Product-Moment Correlation Coefficient is important for understanding our variables, but not quite up to the task of directly created predictive scores. What now?

Well — multiple regression! Only regression will account for interactions among our predictor Xs, recalculating the coefficients (weightings) on the fly each time we add a new predictor variable. Working from the Pearson list at the top of this post, we would add Class year, Single, and Employer present to our regression window one by one. Everything would be fine up to that point; our p-values will be very low for these variables. When we add Position present, however, the p-value will be too high (0.183, which exceeds the rule-of-thumb value of 0.05), and R squared will fail to improve. We would therefore leave Position present out of the regression because it isn’t adding any new predictive information to the model and might interfere with the effectiveness of other variables in subtle and strange ways.

Often when I use the word “regression” on someone, what I see reflected back in their eyes is fear. (I really need to reserve that word for people I don’t like.) I wish, though, that people could see that regression is a bit like an automobile: A complex machine with many moving parts, but familiar and approachable, with a simple and comprehensible purpose, and above all operable by just about anyone.

Older Posts »