I have written very little about model validation on CoolData, probably because I’ve always had conflicting thoughts about it.
Model validation is important for avoiding serious error when predicting behaviours that are relatively rare, such as major or planned giving. But if the event is rare, the data is sparse. When I need to train the data on a relatively small number of cases, I am loathe to rob my model of half the cases I need for training.
In an annual fund model it’s no big deal. If the predicted value is participation in the fund, that’s not a rare event and there is plenty of data. There’s no angst about going the prescribed route: splitting the file into two random halves before modeling begins — a training sample used to calculate the scores, and a validation sample on which to test the validity of those scores. The irony is, most annual fund models are robust and hardly require validation.
It’s exactly when doing the tricky modeling projects (major giving, planned giving) that validation is most important. If a major-giving model proves valid, that’s great — but it might have been even stronger had I kept the data file intact. If it proves faulty, how much of that can be blamed on the fact that half my cases are unavailable?
Validation of a model has very little to do with R squared, by the way. That statistic measures only how well a multiple linear regression model fits the data set you’re working with. It doesn’t tell you how well it will perform for prediction. A very high (60% or more) or very low (10% or less) value for R squared signals trouble with the design of your model; most models will fall into that safe zone, regardless of their usefulness for prediction.
My thinking is this. Validating a model has one of two results: Either you accept the model, or you throw it out. Unless you’ve got other scoring or segmentation tricks up your sleeve, the alternative to having a predictive model is using nothing at all. How often does implementing a model prove worse than having no model? I am willing to wager: almost never. Even a score set that contains a certain amount of random noise is an improvement over not using any score. If you’re going to build a single model and use it come what may, flaws and all, why bother to validate it? Not to mention that there is no clear dividing line between a valid model and an invalid one.
The key is that I do not create one model, but several. Creating more than one model gives me options. I no longer have to compare the validity of my model against some arbitrary rule of thumb — I can compare multiple models against each other and choose the best one. Instead of splitting my data file in half and losing half of my precious training cases, I set aside a small but reasonably representative random sample of cases. This sample is held out of the model, but also scored, allowing me to compare the relative success of each model.
The benefits are twofold: One, I don’t lose nearly as many cases to the validation set, and two, I can still get some idea about how good the model is. If all the models are lousy, it’s going to show up in how the holdout cases are distributed by score.
Here is an example. Recently I needed to develop a model to predict propensity to give at higher levels. The table below summarizes how I set out to build two very different models.
Without going into too much detail, you can see that I could have chosen to build a number of different models by mixing and matching the criteria available to me:
- binary logistic regression or multiple linear regression?
- include everyone in the model (even non-donors), or a narrowed-down selection of individuals?
- binary dependent variable (set to represent any level of giving), or a continuous dependent variable (LT giving)?
- giving-related variables included or excluded?
However, in the interest of time and other practical concerns I chose only two contrasting options. As well, some choices are made for me: If I want to score non-donors in my model, I am obliged to leave out all giving-related variables, for example. (I’ve written plenty about these scenarios recently, so I won’t get into that again.)
The one thing held constant between the models I built was the holdout sample. I chose 10 current major donors at random and held them out of both models. The models were innocent of these donors’ status, but used their characteristics to assign scores to them anyway. So whichever model did a better job of giving high scores to this sample was the one I chose to use in our program.
Only ten individuals? Yes, that’s a tiny holdout sample, but when you’re trying to model for propensity to give at an exclusive level, you need to conserve your precious training cases. As it turned out, ten holdout cases was enough to reveal differences in reliability. I was surprised at which model ended up winning the trophy — but that’s a post for another time.