Survey data is valuable for predicting mass behaviours. Inevitably, though, some pieces will be missing. (Photo by Photo Monkey, used by Creative Commons license. Click photo for more.)

In a previous post I talked about the great predictive power of survey responses. Today I’ll explain what to do about one of the roadblocks you’ll encounter – **missing data**.

The problem of missing data is a big issue in statistics, and a number of techniques are available for dealing with it. The ideas I offer here may or may not meet the high standards of a statistician, but they do offer some more or less reasonable solutions.

Let’s say you’re interested in creating only one variable from the survey, an **indicator variable** which records whether a person participated or not. The mere presence or absence of this data point will probably be predictive. You could code all **responders **as ‘1’, and **everyone else** as ‘0’. This would work well if a large portion of your sample received an invite.

Alternately, you could code responders as ‘1’, non-responders as ‘-1’, and put the **“uninvited,”** everyone who didn’t receive an invitation, into the neutral middle zone by giving them a zero. To avoid negative numbers, just add one to each of these values; it may seem strange to reward the uninvited a ‘1’, but what you’re trying to do here is see if they differ from the people who actually had a chance to participate and chose not to: An action taken, in a negative direction.

Use only one or the other of these two variables, whichever one ‘works’. Test both against mean and median lifetime giving (or whatever your dependent variable is). If the three-level variable shows a nice** linear relationship** with your DV – with low giving for the non-responders, higher giving for the uninvited, and highest giving for the responders – then use that variable in your regression.

Statistically sound? Perhaps not. But if the alternative is tossing out a potentially valuable predictor, I don’t see the harm.

That covers missing data for the simple fact of participation / non-participation in a survey. You can go much deeper than that. A typical survey of alumni will yield many potential predictor variables. If your survey is getting at attitudes about your institution, or about giving or volunteering, or attending events, responses to **individual questions** can be powerfully predictive. Again, if you’re using regression, all you need to do is find a logical way to re-express the response as a **number**.

For example, you can recode yes/no questions as 1/0 indicator variables. If the responses to a question are **categorical **in nature (for a question such as, “What is your mother’s nationality?”), you may wish to test indicator variables for the various responses, and along with that, an indicator variable for “Did not answer the question.” In such a case, missing data may have its own underlying pattern (i.e. it is non-random), and may correlate with the value you’re trying to predict.

Sometimes the data is already expressed as a number. Surveys often use **Likert scale** questions, in which responders are asked to rate their level of agreement to a statement. Typical responses might range from 1 (“strongly disagree”) to 5 or 6 (“strongly agree”). Likert scales are, strictly speaking, ordinal in nature, not continuous: There’s no reason to believe that the “distance” between ‘1’ and ‘2’ is the same as the “distance” between ‘2’ and ‘3’. However, I accept them as a logical ranking, perfectly suitable for a regression independent variable.

(Incidentally, a survey designer will use an even-numbered scale if he wants to disallow neutral responses, a practice which gets an answer but sometimes causes frustrated survey-takers to skip questions.)

So: rich numerical data, ready to plug into our analysis, but only one problem: **Missing data**. Some people skip questions, some fail to take the survey, others were never invited in the first place. This time, you can’t just plug a zero into every empty space. Innocent non-responders would come off as a very negative bunch, completely throwing off your predictor. But again, if you don’t have some type of number present for that variable for all cases in your dataset, they’ll be excluded from the regression analysis. At the risk of oversimplifying, I would say you’ve got **three options**, with three levels of sophistication:

- Substitution of a neutral value.
- Substitution of a mean value.
- Imputation.

**1. Neutral-value substitution**. This is the method I used the first time I incorporated a lot of Likert-scale type data into a predictive model. It was very simple. Every person with a missing value for a given question received a value falling perfectly halfway between “strongly disagree” and “strongly agree.” For a scale of 1 to 6, that value is 3.5. Of course, ‘3.5’ was not a possible choice on the survey itself, which forced respondents to commit to a slightly negative or positive response, but that didn’t mean I couldn’t use the middle value to replace our unknown values.

There is one problem with this method, though … if you think about it, **what’s so ‘neutral’ about 3.5**? If you took all the actual responses and calculated the average, it might be significantly higher or lower than 3.5. Let’s say actual respondents had an average response of 5 for a particular question. If we code everyone else as 3.5, that’s characterizing them as **negative**, in relation to the respondents. We may have no basis for doing so.

**2. Mean-value substitution**. The problem I describe above can be addressed by mean-value substitution, which is the method I perhaps **should **have used. If the average response for your actual respondents is 2.67, then substitute 2.67 for all your missing values. If it’s 5, use 5. (If your response data is not Likert-scale in nature, but rather contains extreme values, use the median value for the variable rather than the average value.)

**3. Imputation**. This term is used to describe a variety of related methods for guessing the “most likely” missing value based on the values found in other variables. These methods include some advanced options made available in software such as SAS and SPSS.

The third option may be regarded as the best from a statistical point of view. Alas, I have not used these more advanced techniques. I can only speak from my experience with the first two. For now at least, I accept the **drawbacks **of substituting the population mean for missing data (one drawback being a gross underestimation of variance), in order for me to quickly and easily tap the power of survey data in my models.

What would **YOU **do?