CoolData blog

7 May 2010

More on surveys and missing data

Filed under: Statistics, Surveying — Tags: , — kevinmacdonell @ 8:35 am

A few months ago I blogged about survey data (specifically Likert scale data), the inevitable problem of missing data, and a few simple ideas for how to smooth over that problem prior to using the data in a regression model. (“Surveys and missing data.“) I recently had a great email from Tom Knapp, Professor Emeritus, University of Rochester and The Ohio State University (bio), who pointed out that I left out “one popular but controversial option”: Substitute for a missing response the mean of that person’s other responses.

He writes: “The main reason it’s controversial is that it tends to artificially increase the internal-consistency reliability of the measuring instrument — it makes the items look like they hang together better than they do.”

(For our purposes, which are purely predictive, I think this approach is justified. When the alternative is dropping the case from the analysis, we should not feel we’re messing with the data if we seek other methods that offer significant predictors.)

Professor Knapp is not a fan of any imputation methods that are based on “missing at random” or “missing completely at random” assumptions. “People don’t flip coins in order to decide whether or not to respond to an item,” he writes.

(I wonder, though, if there’s any harm in using these methods if the number of missing answers is low, and there is no apparent pattern to the missingness?)

And finally, Prof. Knapp offered this very helpful paper written by Tom Breur, Missing Data and What to Do About It, available as a free download here. It’s a couple of shades more technical than my previous discussion but it’s an excellent overview.

Postscript: Today (18 May 2010) as I am engaged in replacing missing values in Likert scale questions in some survey data I am analyzing, I realize one strength of substituting the mean of a person’s other responses: It ensures that people who skip a lot of questions and give negative responses to the ones they do answer won’t end up resembling the rest of the survey population, who may have given very positive responses. It might be more reasonable to assume that a person’s responses, had they bothered to offer them, would have been consistent with the responses they did offer – more reasonable, surely, than assuming they would be consistent with how the survey population in general responded. So perhaps incorporating survey data into a regression model requires two stages of missing-value replacement: 1) same-person mean response substitution (a la Prof. Knapp, above) for respondents who failed to answer all the questions, and 2) mean-value substitution for all the rest of the population who did not participate in the survey or were not invited. If that sounds like a lot of bother, yes it is. But knowing how darned predictive survey data is, I would be willing to go to the extra trouble to get it right.


  1. Great post Kevin! Are you in agreement with Dr. Knapp that things like Multiple Imputation are the wrong way to go?

    Comment by Stats Make Me Cry Guy (Jeremy) — 7 May 2010 @ 9:22 am

    • Jeremy, I can’t say I fall on one side or the other. I think it very much depends on the characteristics of the data set you’re working with, and the size and nature of the missing data. Plus, I have little to say about multiple imputation because up to this point I have not used those techniques. I focus more on the methods that are fairly straightforward to do and explain, and which I have actually used. At some future time I will know enough to have an opinion.

      Comment by kevinmacdonell — 7 May 2010 @ 9:45 am

  2. Hi,
    I favour (strongly) multiple imputation, as it’s the only way I know to properly estimate what you want to estimate, without making the very strong MAR and MCAR assumptions. As you rightly say, real people do not toss coins to decide whether or not to answer a question.
    It’s tougher to do, but worth the effort, in my experience,

    Comment by Anthony Staines — 10 June 2010 @ 5:57 pm

  3. […] was talking about survey data, but the idea is exactly the same. (See Surveys and missing data and More on surveys and missing data.) Essentially, the simpler techniques for imputing missing data involve substituting average values […]

    Pingback by New tricks for old data « CoolData blog — 30 August 2010 @ 7:38 am

RSS feed for comments on this post. TrackBack URI

Leave a Reply to New tricks for old data « CoolData blog Cancel reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at

%d bloggers like this: