A few months ago I blogged about survey data (specifically Likert scale data), the inevitable problem of missing data, and a few simple ideas for how to smooth over that problem prior to using the data in a regression model. (“Surveys and missing data.“) I recently had a great email from Tom Knapp, Professor Emeritus, University of Rochester and The Ohio State University (bio), who pointed out that I left out “one popular but controversial option”: Substitute for a missing response the mean of that person’s other responses.
He writes: “The main reason it’s controversial is that it tends to artificially increase the internal-consistency reliability of the measuring instrument — it makes the items look like they hang together better than they do.”
(For our purposes, which are purely predictive, I think this approach is justified. When the alternative is dropping the case from the analysis, we should not feel we’re messing with the data if we seek other methods that offer significant predictors.)
Professor Knapp is not a fan of any imputation methods that are based on “missing at random” or “missing completely at random” assumptions. “People don’t flip coins in order to decide whether or not to respond to an item,” he writes.
(I wonder, though, if there’s any harm in using these methods if the number of missing answers is low, and there is no apparent pattern to the missingness?)
And finally, Prof. Knapp offered this very helpful paper written by Tom Breur, Missing Data and What to Do About It, available as a free download here. It’s a couple of shades more technical than my previous discussion but it’s an excellent overview.
Postscript: Today (18 May 2010) as I am engaged in replacing missing values in Likert scale questions in some survey data I am analyzing, I realize one strength of substituting the mean of a person’s other responses: It ensures that people who skip a lot of questions and give negative responses to the ones they do answer won’t end up resembling the rest of the survey population, who may have given very positive responses. It might be more reasonable to assume that a person’s responses, had they bothered to offer them, would have been consistent with the responses they did offer – more reasonable, surely, than assuming they would be consistent with how the survey population in general responded. So perhaps incorporating survey data into a regression model requires two stages of missing-value replacement: 1) same-person mean response substitution (a la Prof. Knapp, above) for respondents who failed to answer all the questions, and 2) mean-value substitution for all the rest of the population who did not participate in the survey or were not invited. If that sounds like a lot of bother, yes it is. But knowing how darned predictive survey data is, I would be willing to go to the extra trouble to get it right.