# CoolData blog

## 26 August 2010

### What the heck IS “Y”, anyway?

Filed under: Model building, Predictive scores, regression, Statistics — Tags: — kevinmacdonell @ 8:46 am

At the APRA Conference in Anaheim a month ago, a session attendee was troubled by something he saw during a presentation given by David Robertson of Syracuse. The attendee was focused on the “constant” value in David’s example of a multiple linear regression model for propensity to give. This constant, which I will talk about shortly, was some significant figure, say “50”. Because the Y value (i.e., the dependent variable, or outcome variable, or predicted value) was expressed in dollars (of giving), then this seemed to indicate that the “floor” for giving, the minimum value someone could be predicted to give, was \$50.

How do you figure that?, this attendee wanted to know. It’s a reasonable question, for which I will try to provide my own answer. (More knowledgeable stats people may wish to weigh in; it would be appreciated.) There are implications here for how we interpret the predicted value of “Y”.

When you do a regression analysis, your software will automatically calculate this “constant,” which is simply the first term (“a”) in the regression equation: In other words, if all your predictor variables (X’s) calculate out to zero, then Y will equal ‘a’. The part of this that the attendee found hard to swallow was that the minimum possible amount an alum could donate, as predicted by the model, was something greater than zero dollars. It seemed nonsensical.

Well, yes and no. First of all, the constant is no such thing. If you were to plot a regression line, that straight line has to cross the Y axis somewhere. The value of Y when the sum of X’s is zero is that crossing point (a.k.a. the Y-intercept). But that doesn’t mean it’s the minimum. Y does equal zero at a point: When the sum of predictors is negative — that is, when the regression line passes to the left of the Y axis and down, and crosses the X axis (the X-intercept).

So, really, you’re not learning much by looking at the constant. It’s a mathematical necessity — it describes an important aspect of what any line looks like when plotted — but that’s all. While the constant is always present in our regression analysis for predictive modeling, we tend to ignore it.

But all this is leading me to an even more fundamental question, the one posed in the headline: What is Y?

In David’s example, the one that so perplexed my fellow conference attendee, Y was expressed in real dollars. This is valid modeling practice. However, I have never looked at Y in real units (i.e., dollars), due to difficulty in interpreting the result. For example, the output of multiple linear regression can be negative: Does that mean the prospect is going to take money from us? As well, when we work with a transformed version of the DV (such as the natural log, which is very common), the output will need to be transformed back in order to make sense.

I sidestep issues of interpretation by simply assuming that the predicted value is meaningless in itself. What I am primarily interested in is relative probability, and where a value ranks in comparison with the values predicted for other individuals in the sample. In other words, is a prospect in the top 10% of alumni? Or the top 0.5%? Or the bottom 20%? The closer an individual is to the top of the heap, the more likely he or she is to give, and at higher levels.

I rank everyone in the sample by their predicted values, and then chop the sample up into deciles and percentiles. Percentiles, I am careful to explain, are not the same thing as probabilities: Someone in the 99th percentile is not 99% likely to make a gift. They might be 60% likely — it depends. The important thing is that someone in the 98th percentile will be slightly less likely to give, and someone in the 50th percentile will be MUCH less likely to give.

This highlights an important difference between multiple linear regression, which I’m talking about here, and binary logistic regression. The output of the latter form of regression is “probability”; very useful, and not so difficult to interpret. Not so with multiple linear regression — the output in this case is something different, which we may interpret in various ways but which will not directly give us values for probability.

Fortunately, fundraisers are already very familiar with the idea of ranking prospects in descending order by likelihood (or capacity, or inclination, or preferably some intelligent combination of these). Most people can readily understand what a percentile score means. For us data modelers, though, getting from “raw Y” to a neat score takes a little extra work.