In a previous post I mentioned in passing that for a particular predictive model using multiple regression, I re-expressed the dependent variable (‘Giving’) as a logarithmic function of the value. A reader commented, “I’m hoping that you will some day address the reasons for re-expressing the DV as a log. I’ve been searching for a good explanation in this context.” I said I’d have to get back to him on that.
Well, that was two months ago. I had to do some research, because I didn’t have the right words to express the reasoning behind transforming the dependent variable. I consulted a variety of texts and synthesized the bits I found to produce the summary below, using examples you’re likely to see in a fundraising database. Some of this will be tough chewing for the stats-innocent; do not feel you need to know this in order to use multiple regression. (For the keeners out there, just be aware that this discussion barely scratches the surface. Typical with all topics in statistics and modeling!)
Multiple regression works most reliably when the inputs come in a form that is well-known. The “form” we’re talking about is the distribution of the data. If the distribution of your data approximates that of a theoretical probability distribution, we can perform calculations on the data that are based on assumptions we can make about the well-known theoretical distribution. (Got that?)
The best-known example ‘theoretical distribution’ is the normal distribution, a.k.a. the famous “bell curve.” It looks just like a bell. (Duh.) The properties and characteristics of the normal probability distribution are well-known, which is important for the validity of the results we see in our regression analysis. (P-values, for example, which inform us whether our predictive variables are significant or not.)
Let’s say our dependent variable is ‘Lifetime Giving‘. When we create a histogram of this variable, we can see that it isn’t distributed normally at all. There’s a whole pile of very small values at one end, and the larger values aren’t visible at all.
In order to make the variable better fit the assumptions underlying regression, we need to transform it. There are a number of ways to do this, but the most common for our purposes is to take the log of ‘Giving’. (This is easily done in Data Desk using a derived variable and the ‘log’ statement; just remember to take the log of ‘Giving’ plus a nominal value of 1, because you can’t take a log of zero.) When we call up a histogram of ‘Log of Lifetime Giving’, we can see that the distribution is significantly closer to the normal probability distribution. It’s a bit skewed to one side, but it’s a big improvement.
For the sake of this demonstration, I have left out all the individuals who have no giving. All those zero values would mess with the distribution, and the effect of the transformation would not be as evident in my chart. In the real world, of course, we include the non-donors. The resulting DV is far from ideal, but again, it’s a big improvement over the untransformed variable.
Our goal in transforming variables is not to make them more pretty and symmetrical, but to make the relationship between variables more linear. Ultimately we want to produce a regression equation which “both characterizes the data and meets the conditions required for accurate statistical inference,” (to quote Jacob Cohen et al., from the excellent text, “Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences,” page 233).
Linear relationships that are not evident using an untransformed form of ‘Lifetime Giving’ may be rendered detectable after transformation. So, in short, we transform variables in hopes of improving the overall model, which after all is a linear model.