In a previous post I mentioned in passing that for a particular predictive model using multiple regression, I re-expressed the dependent variable (‘Giving’) as a logarithmic function of the value. A reader commented, “I’m hoping that you will some day address the reasons for re-expressing the DV as a log. I’ve been searching for a good explanation in this context.” I said I’d have to get back to him on that.
Well, that was two months ago. I had to do some research, because I didn’t have the right words to express the reasoning behind transforming the dependent variable. I consulted a variety of texts and synthesized the bits I found to produce the summary below, using examples you’re likely to see in a fundraising database. Some of this will be tough chewing for the stats-innocent; do not feel you need to know this in order to use multiple regression. (For the keeners out there, just be aware that this discussion barely scratches the surface. Typical with all topics in statistics and modeling!)
Multiple regression works most reliably when the inputs come in a form that is well-known. The “form” we’re talking about is the distribution of the data. If the distribution of your data approximates that of a theoretical probability distribution, we can perform calculations on the data that are based on assumptions we can make about the well-known theoretical distribution. (Got that?)
The best-known example ‘theoretical distribution’ is the normal distribution, a.k.a. the famous “bell curve.” It looks just like a bell. (Duh.) The properties and characteristics of the normal probability distribution are well-known, which is important for the validity of the results we see in our regression analysis. (P-values, for example, which inform us whether our predictive variables are significant or not.)
Let’s say our dependent variable is ‘Lifetime Giving‘. When we create a histogram of this variable, we can see that it isn’t distributed normally at all. There’s a whole pile of very small values at one end, and the larger values aren’t visible at all.
In order to make the variable better fit the assumptions underlying regression, we need to transform it. There are a number of ways to do this, but the most common for our purposes is to take the log of ‘Giving’. (This is easily done in Data Desk using a derived variable and the ‘log’ statement; just remember to take the log of ‘Giving’ plus a nominal value of 1, because you can’t take a log of zero.) When we call up a histogram of ‘Log of Lifetime Giving’, we can see that the distribution is significantly closer to the normal probability distribution. It’s a bit skewed to one side, but it’s a big improvement.
For the sake of this demonstration, I have left out all the individuals who have no giving. All those zero values would mess with the distribution, and the effect of the transformation would not be as evident in my chart. In the real world, of course, we include the non-donors. The resulting DV is far from ideal, but again, it’s a big improvement over the untransformed variable.
Our goal in transforming variables is not to make them more pretty and symmetrical, but to make the relationship between variables more linear. Ultimately we want to produce a regression equation which “both characterizes the data and meets the conditions required for accurate statistical inference,” (to quote Jacob Cohen et al., from the excellent text, “Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences,” page 233).
Linear relationships that are not evident using an untransformed form of ‘Lifetime Giving’ may be rendered detectable after transformation. So, in short, we transform variables in hopes of improving the overall model, which after all is a linear model.
Excellent, thank you for this! This is a really useful explanation.
Comment by John — 4 March 2010 @ 1:36 pm
As I see you are mentioning statistical research: I have put one of the most comprehensive link lists for hundreds of thousands of statistical sources and indicators on my blog: Statistics Reference List. And what I find most fascinating is how data can be visualised nowadays with the graphical computing power of modern PCs, as in many of the dozens of examples in these Data Visualisation References. If you miss anything that I might be able to find for you or if you yourself want to share a resource, please leave a comment.
Comment by crisismaven — 4 March 2010 @ 2:49 pm
If you’re trying to use the model for predictive purposes, do you have any recommendations on how to deal with the non-normally distributed error?
I came across the problem quite a while ago when forecasting demand for a category of goods with a long-tailed sales pattern: some would sell exceedingly well, but most would sell in small volumes. Transforming the dependent variable created a very nice linear relationship between the independents, but trying to use the same model for predictive purposes gave forecasts which, in practice, tended to have significantly higher upper bounds of error compared to their lower bounds of error. Mathematically that made sense, but interpretatively, it was fairly hard to communicate.
Are there any techniques you’ve found that help?
Comment by Evan Stubbs — 8 March 2010 @ 12:14 am
Evan, the short answer to your question is “no.” I am somewhat aware of the issues caused by distributions violating the assumptions underlying regression analysis, such as your Poisson-like long-tail pattern, which is partly why I avoid modeling for rare events. (I know better than to pretend to be able to predict million-dollar gifts from the alumni of our small school.)
On one occasion I did examine the distribution of the error but it yielded no useful insights, perhaps because a lack of knowledge meant I couldn’t interpret what I was looking at! I could toss out ideas (that another flavour of regression is required, based on different assumptions about the distribution of your DV, or that the correlations may not be linear to begin with), but I suspect your insights will be superior to mine. (I invite others to comment, of course.)
My predicted values, it might be relevant to add, are used only to rank individuals by likelihood to engage in the behaviour of interest. The results delivered to end users are rather broad categories (eg. deciles containing thousands of individuals, or percentiles containing hundreds of individuals), rather than precise predicted values (eg. predicted gift dollars), or strictly Yes/No results.
In other words, I deliver results that are an improvement on throwing darts at a board – sometimes a big improvement, sometimes not as big. I avoid truly problematic predictive tasks, and I do not get too deep into some of the finer points of regression.
I presume that answer is inadequate, but that’s all I’ve got.
Comment by kevinmacdonell — 8 March 2010 @ 8:02 am
No, that’s actually a good answer in my books … I was curious to know whether you’d come up with anything, mainly because I didn’t manage to!
In retrospect, I think you’re probably right – an alternative approach may have been better. We still achieved everything we needed to, I just think we may have done more work than was necessary to overcome the requirements of normality, homogeneity, etc. Part of the problem was that we were really interested in the differentiation between those high-performing goods vs. the average performing. A binary or binned target would have been nice, but we needed a fairly granular prediction, as the numbers were feeding into a preliminary business case for the first round of an engineering project …
Fun stuff though, and interesting. Write more! 😉
Comment by Evan Stubbs — 10 March 2010 @ 9:57 pm
[…] to make the linear relationships among variables much more evident. (Why we transform variables is explained more fully here.) If we used ‘giving’ just as it was, the Pearson values would be very low, which would […]
Pingback by Pearson product-moment correlation coefficient « CoolData blog — 28 April 2010 @ 8:54 am