CoolData blog

31 January 2010

Using survey data in regression models

Filed under: Predictor variables, Surveying — Tags: , — kevinmacdonell @ 9:36 pm

Surveys can provide a rich load of fresh data you can incorporate into your models. The very act of agreeing to participate in a survey is a trait likely to be highly predictive, regardless of the model you’re building.

If you work at a university, be attuned to people at your institution who might be surveying large numbers of alumni, and encourage them to make their surveys non-anonymous. They’ll get much richer possibilities for analysis if they can relate responses to demographic information in the database (class year, for example). Remind them that people aren’t necessarily put off by non-anonymous surveys; if they were, restaurants, retailers and other private-sector corporations wouldn’t bother with all the customer-satisfaction surveying that they do. Non-anonymity is a basic requirement for data mining: If you don’t know who’s giving the answers, you’ve got nothing.

Your database provides the ideal key to uniquely identify respondents. It doesn’t even have to be a student ID. The unique ID of each person’s database record (if you use Banner, the PIDM) is perfect: It’s unique to the individual, but otherwise it’s meaningless outside of the database. No one outside your institution can link it to other data, so there is no privacy issue if you incorporate it in a mail-merged letter or email inviting people to participate. It can even be added to the printed label of an alumni magazine.

If you’ve got good email addresses for a sizable chunk of your alumni, you’ve got what you need to provide a unique ID you can email to each person to log into a survey online – without requiring them to provide their name or any other information you’ve already got in your database. (A cheap software plug-in for Outlook does a fine job of automating the process of mail-merges.)

I said, “surveying large numbers of alumni”. That’s important. A survey directed solely at the Class of 1990, or only at the attendees of the past Homecoming, is of limited use for modeling. A broad cross-section of your sample should have had at least the opportunity to participate. Otherwise, your variable or variables will be nothing more than proxies for “graduated in 1990” and “attended Homecoming.”

But probably you’re not inviting every living alumnus/na to participate. And even if you did, most of them wouldn’t take part. This creates a problem with your subsequent model building: missing data. If you use multiple regression, your software will toss out all the cases that have missing data for any of the predictor variables you pull in. You’ve got to put something in there, but what?

I’ll you what, in the next post!


1 Comment »

  1. […] a previous post I talked about the great predictive power of survey responses. Today I’ll explain what to do […]

    Pingback by Surveys and missing data « CoolData blog — 1 February 2010 @ 9:24 pm

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create a free website or blog at

%d bloggers like this: