CoolData blog

11 May 2015

A new way to look at alumni web survey data

Filed under: Alumni, Surveying, Vendors — Tags: , , , , — kevinmacdonell @ 7:38 pm

Guest post by Peter B. Wylie, with John Sammis


Click to download the PDF file of this discussion paper: A New Way to Look at Survey Data


Web-based surveys of alumni are useful for all sorts of reasons. If you go to the extra trouble of doing some analysis — or push your survey vendor to supply it — you can derive useful insights that could add huge value to your investment in surveying.


This discussion paper by Peter B. Wylie and John Sammis demonstrates a few of the insights that emerge by matching up survey data with some of the plentiful data you have on alums who respond to your survey, as well as those who don’t.


Neither alumni survey vendors nor their higher education clients are doing much work in this area. But as Peter writes, “None of us in advancement can do too much of this kind of analysis.”


Download: A New Way to Look at Survey Data



18 November 2010

Survey says … beware, beware!

Filed under: Alumni, skeptics, Surveying — Tags: , , — kevinmacdonell @ 4:45 pm

I love survey data. But sometimes we get confused about what it’s really telling us. I don’t claim to be an expert on surveying, but today I want to talk about one of the main ways I think we’re led astray. In brief: Surveys would seem to give us facts, or “the truth”. They don’t. Surveys reveal attitudes.

In higher education, surveying is of prime importance in benchmarking constituent engagement in order to identify programmatic areas that are underperforming, as well as areas that are doing well and for which making changes therefore entails risk. Making intelligent, data-driven decisions in these areas can strengthen programming, enhance engagement, and finally increase giving. And there’s no doubt that the act of responding to a survey, the engagement score that might result, and the responses to individual questions or groups of questions, are all predictive of giving. I have found so myself in my own predictive modeling at two universities.

But let’s not get carried away. Survey data can be a valuable source of predictor variables, but it’s a huge leap from making that admission to saying that survey data trumps everything.

I know of at least one vendor working in the survey world who does make that leap. This vendor believes surveying is THE singular best way to predict giving, and that survey responses have it all over the regular practice of predictive modeling using variables mined from a database. Such “archival” data provides “mere correlates” of engagement. Survey data provides the real goods.

I see the allure. Why would we put any stock in some weak correlation between the presence of an email address and giving, when we can just ask them how they feel about giving to XYZ University?


I have incorporated survey data in my own models, data that came from two wide-ranging, professionally-designed, Likert-type surveys of alumni engagement. Survey data is great because it’s fresh, independent of giving, and revealing of attitudes. It is also extremely biased in favour of highly-engaged alumni, and is completely disconnected from reality when it comes to gathering facts as opposed to attitudinal data.

Let me demonstrate the unreliability of survey data with regard to facts. Here are a few examples of statements and responses (one non-Likert), gathered from surveys of two institutions:

  • “I try to donate every year” — 946 individuals answered “agree” or “strongly agree” — but 12.3% of those 946 had no lifetime giving.
  • “I support XYZ University regularly” — 1,001 individuals answered “agree” or “strongly agree” — but 18.7% of them had no lifetime giving.
  • “Have you ever made a charitable gift to XYZ University (Y/N)?” — 1,690 individuals said “Yes” — but 8.1% of them had no lifetime giving.
  • “I support XYZ University to the best of my capacity” — 1,498 individuals answered “agree” or “strongly agree” — but 39.6% of them had no lifetime giving!

And, even stranger:

  • “I try to donate every year” — 1,371 answered “disagree” or “strongly disagree” — but 27.7% of those respondents were in fact donors!

Frankly, if I asked survey-takers how many children they have, I wouldn’t trust the answers.

This disconnect from reality actually works in my favour when I am creating predictive models, because I have some assurance that the responses to these questions is not just a proxy for ‘giving’, but rather a far more complicated thing that has to do with attitude, not facts. But in no model I’ve created has survey data (even carefully-selected survey data strongly correlated with giving) EVER been more predictive than the types of data most commonly used in predictive models — notably age/class year, the presence/absence of certain contact information, marital status, employment information, and so on.

For the purposes of identifying weaknesses or strengths in constituent engagement, survey data is king. For predicting giving in its various forms, survey data and engagement scores are just more variables to test and work into the model — nothing more, nothing less — and certainly not something magical or superior to the data that institutions already have in their databases waiting to be mined. I respect the work that people are doing to investigate causation in connection with giving. But when they criticize the work of data miners as “merely” dealing in correlation, well that I have a problem with.

1 February 2010

Surveys and missing data

Filed under: Model building, Pitfalls, Surveying — Tags: , , , , — kevinmacdonell @ 9:24 pm

Survey data is valuable for predicting mass behaviours. Inevitably, though, some pieces will be missing. (Photo by Photo Monkey, used by Creative Commons license. Click photo for more.)

In a previous post I talked about the great predictive power of survey responses. Today I’ll explain what to do about one of the roadblocks you’ll encounter – missing data.

The problem of missing data is a big issue in statistics, and a number of techniques are available for dealing with it. The ideas I offer here may or may not meet the high standards of a statistician, but they do offer some more or less reasonable solutions.

Let’s say you’re interested in creating only one variable from the survey, an indicator variable which records whether a person participated or not. The mere presence or absence of this data point will probably be predictive. You could code all responders as ‘1’, and everyone else as ‘0’. This would work well if a large portion of your sample received an invite.

Alternately, you could code responders as ‘1’, non-responders as ‘-1’, and put the “uninvited,” everyone who didn’t receive an invitation, into the neutral middle zone by giving them a zero. To avoid negative numbers, just add one to each of these values; it may seem strange to reward the uninvited a ‘1’, but what you’re trying to do here is see if they differ from the people who actually had a chance to participate and chose not to: An action taken, in a negative direction.

Use only one or the other of these two variables, whichever one ‘works’. Test both against mean and median lifetime giving (or whatever your dependent variable is). If the three-level variable shows a nice linear relationship with your DV – with low giving for the non-responders, higher giving for the uninvited, and highest giving for the responders – then use that variable in your regression.

Statistically sound? Perhaps not. But if the alternative is tossing out a potentially valuable predictor, I don’t see the harm.

That covers missing data for the simple fact of participation / non-participation in a survey. You can go much deeper than that. A typical survey of alumni will yield many potential predictor variables. If your survey is getting at attitudes about your institution, or about giving or volunteering, or attending events, responses to individual questions can be powerfully predictive. Again, if you’re using regression, all you need to do is find a logical way to re-express the response as a number.

For example, you can recode yes/no questions as 1/0 indicator variables. If the responses to a question are categorical in nature (for a question such as, “What is your mother’s nationality?”), you may wish to test indicator variables for the various responses, and along with that, an indicator variable for “Did not answer the question.” In such a case, missing data may have its own underlying pattern (i.e. it is non-random), and may correlate with the value you’re trying to predict.

Sometimes the data is already expressed as a number. Surveys often use Likert scale questions, in which responders are asked to rate their level of agreement to a statement. Typical responses might range from 1 (“strongly disagree”) to 5 or 6 (“strongly agree”). Likert scales are, strictly speaking, ordinal in nature, not continuous: There’s no reason to believe that the “distance” between ‘1’ and ‘2’ is the same as the “distance” between ‘2’ and ‘3’. However, I accept them as a logical ranking, perfectly suitable for a regression independent variable.

(Incidentally, a survey designer will use an even-numbered scale if he wants to disallow neutral responses, a practice which gets an answer but sometimes causes frustrated survey-takers to skip questions.)

So: rich numerical data, ready to plug into our analysis, but only one problem: Missing data. Some people skip questions, some fail to take the survey, others were never invited in the first place. This time, you can’t just plug a zero into every empty space. Innocent non-responders would come off as a very negative bunch, completely throwing off your predictor. But again, if you don’t have some type of number present for that variable for all cases in your dataset, they’ll be excluded from the regression analysis. At the risk of oversimplifying, I would say you’ve got three options, with three levels of sophistication:

  1. Substitution of a neutral value.
  2. Substitution of a mean value.
  3. Imputation.

1. Neutral-value substitution. This is the method I used the first time I incorporated a lot of Likert-scale type data into a predictive model. It was very simple. Every person with a missing value for a given question received a value falling perfectly halfway between “strongly disagree” and “strongly agree.” For a scale of 1 to 6, that value is 3.5. Of course, ‘3.5’ was not a possible choice on the survey itself, which forced respondents to commit to a slightly negative or positive response, but that didn’t mean I couldn’t use the middle value to replace our unknown values.

There is one problem with this method, though … if you think about it, what’s so ‘neutral’ about 3.5? If you took all the actual responses and calculated the average, it might be significantly higher or lower than 3.5. Let’s say actual respondents had an average response of 5 for a particular question. If we code everyone else as 3.5, that’s characterizing them as negative, in relation to the respondents. We may have no basis for doing so.

2. Mean-value substitution. The problem I describe above can be addressed by mean-value substitution, which is the method I perhaps should have used. If the average response for your actual respondents is 2.67, then substitute 2.67 for all your missing values. If it’s 5, use 5. (If your response data is not Likert-scale in nature, but rather contains extreme values, use the median value for the variable rather than the average value.)

3. Imputation. This term is used to describe a variety of related methods for guessing the “most likely” missing value based on the values found in other variables. These methods include some advanced options made available in software such as SAS and SPSS.

The third option may be regarded as the best from a statistical point of view. Alas, I have not used these more advanced techniques. I can only speak from my experience with the first two. For now at least, I accept the drawbacks of substituting the population mean for missing data (one drawback being a gross underestimation of variance), in order for me to quickly and easily tap the power of survey data in my models.

What would YOU do?

31 January 2010

Using survey data in regression models

Filed under: Predictor variables, Surveying — Tags: , — kevinmacdonell @ 9:36 pm

Surveys can provide a rich load of fresh data you can incorporate into your models. The very act of agreeing to participate in a survey is a trait likely to be highly predictive, regardless of the model you’re building.

If you work at a university, be attuned to people at your institution who might be surveying large numbers of alumni, and encourage them to make their surveys non-anonymous. They’ll get much richer possibilities for analysis if they can relate responses to demographic information in the database (class year, for example). Remind them that people aren’t necessarily put off by non-anonymous surveys; if they were, restaurants, retailers and other private-sector corporations wouldn’t bother with all the customer-satisfaction surveying that they do. Non-anonymity is a basic requirement for data mining: If you don’t know who’s giving the answers, you’ve got nothing.

Your database provides the ideal key to uniquely identify respondents. It doesn’t even have to be a student ID. The unique ID of each person’s database record (if you use Banner, the PIDM) is perfect: It’s unique to the individual, but otherwise it’s meaningless outside of the database. No one outside your institution can link it to other data, so there is no privacy issue if you incorporate it in a mail-merged letter or email inviting people to participate. It can even be added to the printed label of an alumni magazine.

If you’ve got good email addresses for a sizable chunk of your alumni, you’ve got what you need to provide a unique ID you can email to each person to log into a survey online – without requiring them to provide their name or any other information you’ve already got in your database. (A cheap software plug-in for Outlook does a fine job of automating the process of mail-merges.)

I said, “surveying large numbers of alumni”. That’s important. A survey directed solely at the Class of 1990, or only at the attendees of the past Homecoming, is of limited use for modeling. A broad cross-section of your sample should have had at least the opportunity to participate. Otherwise, your variable or variables will be nothing more than proxies for “graduated in 1990” and “attended Homecoming.”

But probably you’re not inviting every living alumnus/na to participate. And even if you did, most of them wouldn’t take part. This creates a problem with your subsequent model building: missing data. If you use multiple regression, your software will toss out all the cases that have missing data for any of the predictor variables you pull in. You’ve got to put something in there, but what?

I’ll you what, in the next post!

Create a free website or blog at