CoolData blog

7 May 2010

More on surveys and missing data

Filed under: Statistics, Surveying — Tags: , — kevinmacdonell @ 8:35 am

A few months ago I blogged about survey data (specifically Likert scale data), the inevitable problem of missing data, and a few simple ideas for how to smooth over that problem prior to using the data in a regression model. (“Surveys and missing data.“) I recently had a great email from Tom Knapp, Professor Emeritus, University of Rochester and The Ohio State University (bio), who pointed out that I left out “one popular but controversial option”: Substitute for a missing response the mean of that person’s other responses.

He writes: “The main reason it’s controversial is that it tends to artificially increase the internal-consistency reliability of the measuring instrument — it makes the items look like they hang together better than they do.”

(For our purposes, which are purely predictive, I think this approach is justified. When the alternative is dropping the case from the analysis, we should not feel we’re messing with the data if we seek other methods that offer significant predictors.)

Professor Knapp is not a fan of any imputation methods that are based on “missing at random” or “missing completely at random” assumptions. “People don’t flip coins in order to decide whether or not to respond to an item,” he writes.

(I wonder, though, if there’s any harm in using these methods if the number of missing answers is low, and there is no apparent pattern to the missingness?)

And finally, Prof. Knapp offered this very helpful paper written by Tom Breur, Missing Data and What to Do About It, available as a free download here. It’s a couple of shades more technical than my previous discussion but it’s an excellent overview.

Postscript: Today (18 May 2010) as I am engaged in replacing missing values in Likert scale questions in some survey data I am analyzing, I realize one strength of substituting the mean of a person’s other responses: It ensures that people who skip a lot of questions and give negative responses to the ones they do answer won’t end up resembling the rest of the survey population, who may have given very positive responses. It might be more reasonable to assume that a person’s responses, had they bothered to offer them, would have been consistent with the responses they did offer – more reasonable, surely, than assuming they would be consistent with how the survey population in general responded. So perhaps incorporating survey data into a regression model requires two stages of missing-value replacement: 1) same-person mean response substitution (a la Prof. Knapp, above) for respondents who failed to answer all the questions, and 2) mean-value substitution for all the rest of the population who did not participate in the survey or were not invited. If that sounds like a lot of bother, yes it is. But knowing how darned predictive survey data is, I would be willing to go to the extra trouble to get it right.

1 February 2010

Surveys and missing data

Filed under: Model building, Pitfalls, Surveying — Tags: , , , , — kevinmacdonell @ 9:24 pm

Survey data is valuable for predicting mass behaviours. Inevitably, though, some pieces will be missing. (Photo by Photo Monkey, used by Creative Commons license. Click photo for more.)

In a previous post I talked about the great predictive power of survey responses. Today I’ll explain what to do about one of the roadblocks you’ll encounter – missing data.

The problem of missing data is a big issue in statistics, and a number of techniques are available for dealing with it. The ideas I offer here may or may not meet the high standards of a statistician, but they do offer some more or less reasonable solutions.

Let’s say you’re interested in creating only one variable from the survey, an indicator variable which records whether a person participated or not. The mere presence or absence of this data point will probably be predictive. You could code all responders as ‘1’, and everyone else as ‘0’. This would work well if a large portion of your sample received an invite.

Alternately, you could code responders as ‘1’, non-responders as ‘-1’, and put the “uninvited,” everyone who didn’t receive an invitation, into the neutral middle zone by giving them a zero. To avoid negative numbers, just add one to each of these values; it may seem strange to reward the uninvited a ‘1’, but what you’re trying to do here is see if they differ from the people who actually had a chance to participate and chose not to: An action taken, in a negative direction.

Use only one or the other of these two variables, whichever one ‘works’. Test both against mean and median lifetime giving (or whatever your dependent variable is). If the three-level variable shows a nice linear relationship with your DV – with low giving for the non-responders, higher giving for the uninvited, and highest giving for the responders – then use that variable in your regression.

Statistically sound? Perhaps not. But if the alternative is tossing out a potentially valuable predictor, I don’t see the harm.

That covers missing data for the simple fact of participation / non-participation in a survey. You can go much deeper than that. A typical survey of alumni will yield many potential predictor variables. If your survey is getting at attitudes about your institution, or about giving or volunteering, or attending events, responses to individual questions can be powerfully predictive. Again, if you’re using regression, all you need to do is find a logical way to re-express the response as a number.

For example, you can recode yes/no questions as 1/0 indicator variables. If the responses to a question are categorical in nature (for a question such as, “What is your mother’s nationality?”), you may wish to test indicator variables for the various responses, and along with that, an indicator variable for “Did not answer the question.” In such a case, missing data may have its own underlying pattern (i.e. it is non-random), and may correlate with the value you’re trying to predict.

Sometimes the data is already expressed as a number. Surveys often use Likert scale questions, in which responders are asked to rate their level of agreement to a statement. Typical responses might range from 1 (“strongly disagree”) to 5 or 6 (“strongly agree”). Likert scales are, strictly speaking, ordinal in nature, not continuous: There’s no reason to believe that the “distance” between ‘1’ and ‘2’ is the same as the “distance” between ‘2’ and ‘3’. However, I accept them as a logical ranking, perfectly suitable for a regression independent variable.

(Incidentally, a survey designer will use an even-numbered scale if he wants to disallow neutral responses, a practice which gets an answer but sometimes causes frustrated survey-takers to skip questions.)

So: rich numerical data, ready to plug into our analysis, but only one problem: Missing data. Some people skip questions, some fail to take the survey, others were never invited in the first place. This time, you can’t just plug a zero into every empty space. Innocent non-responders would come off as a very negative bunch, completely throwing off your predictor. But again, if you don’t have some type of number present for that variable for all cases in your dataset, they’ll be excluded from the regression analysis. At the risk of oversimplifying, I would say you’ve got three options, with three levels of sophistication:

  1. Substitution of a neutral value.
  2. Substitution of a mean value.
  3. Imputation.

1. Neutral-value substitution. This is the method I used the first time I incorporated a lot of Likert-scale type data into a predictive model. It was very simple. Every person with a missing value for a given question received a value falling perfectly halfway between “strongly disagree” and “strongly agree.” For a scale of 1 to 6, that value is 3.5. Of course, ‘3.5’ was not a possible choice on the survey itself, which forced respondents to commit to a slightly negative or positive response, but that didn’t mean I couldn’t use the middle value to replace our unknown values.

There is one problem with this method, though … if you think about it, what’s so ‘neutral’ about 3.5? If you took all the actual responses and calculated the average, it might be significantly higher or lower than 3.5. Let’s say actual respondents had an average response of 5 for a particular question. If we code everyone else as 3.5, that’s characterizing them as negative, in relation to the respondents. We may have no basis for doing so.

2. Mean-value substitution. The problem I describe above can be addressed by mean-value substitution, which is the method I perhaps should have used. If the average response for your actual respondents is 2.67, then substitute 2.67 for all your missing values. If it’s 5, use 5. (If your response data is not Likert-scale in nature, but rather contains extreme values, use the median value for the variable rather than the average value.)

3. Imputation. This term is used to describe a variety of related methods for guessing the “most likely” missing value based on the values found in other variables. These methods include some advanced options made available in software such as SAS and SPSS.

The third option may be regarded as the best from a statistical point of view. Alas, I have not used these more advanced techniques. I can only speak from my experience with the first two. For now at least, I accept the drawbacks of substituting the population mean for missing data (one drawback being a gross underestimation of variance), in order for me to quickly and easily tap the power of survey data in my models.

What would YOU do?

Blog at