CoolData blog

28 May 2010

Surveys and alumni who go the extra mile

Filed under: Alumni, Surveying — Tags: — kevinmacdonell @ 12:37 pm

(Creative commons license. Click image for source.)

Alumni surveys are frequently conducted online, but sometimes there is a mail component as well, either for boosting participation among under-represented demographic groups, or for maximizing response in general. Make sure that whomever is designing the survey includes the mode of response in the results data — it’s very predictive.

Our institution conducted an extensive survey in 2008, and invitations to participate were sent via both mail and email. Not surprisingly, the response rate via mail was quite poor compared with the rate via email. It’s no surprise: Filling out a survey and mailing it in is just not as convenient as clicking around on the computer. As a data miner, you know that survey participation is highly correlated with giving — the relationship is even stronger among alumni who go the extra mile to participate by mail.

In our case, the percentage of the survey response group who have some giving is more than 22 points higher than the alumni population as a whole. But if you split mail from email responders, it’s even more impressive: Mail responders have a donor rate that is 29.1 points higher than the general alumni population.

As well, donors who responded to the survey have much higher average and median lifetime giving than donors who did not, and donors who took the trouble to respond by mail have even higher average and median lifetime giving.

If your by-mail numbers are small (and they probably will be), they might not move the yardsticks in a predictive model by much, especially if the trait is highly correlated with age. But if the data is there, it’s probably worth the extra step of flagging the stamp-lickers with a variable to include in the model. When is the last time you licked a stamp, for anything? People who do it for alma mater are special!

7 May 2010

More on surveys and missing data

Filed under: Statistics, Surveying — Tags: , — kevinmacdonell @ 8:35 am

A few months ago I blogged about survey data (specifically Likert scale data), the inevitable problem of missing data, and a few simple ideas for how to smooth over that problem prior to using the data in a regression model. (“Surveys and missing data.“) I recently had a great email from Tom Knapp, Professor Emeritus, University of Rochester and The Ohio State University (bio), who pointed out that I left out “one popular but controversial option”: Substitute for a missing response the mean of that person’s other responses.

He writes: “The main reason it’s controversial is that it tends to artificially increase the internal-consistency reliability of the measuring instrument — it makes the items look like they hang together better than they do.”

(For our purposes, which are purely predictive, I think this approach is justified. When the alternative is dropping the case from the analysis, we should not feel we’re messing with the data if we seek other methods that offer significant predictors.)

Professor Knapp is not a fan of any imputation methods that are based on “missing at random” or “missing completely at random” assumptions. “People don’t flip coins in order to decide whether or not to respond to an item,” he writes.

(I wonder, though, if there’s any harm in using these methods if the number of missing answers is low, and there is no apparent pattern to the missingness?)

And finally, Prof. Knapp offered this very helpful paper written by Tom Breur, Missing Data and What to Do About It, available as a free download here. It’s a couple of shades more technical than my previous discussion but it’s an excellent overview.

Postscript: Today (18 May 2010) as I am engaged in replacing missing values in Likert scale questions in some survey data I am analyzing, I realize one strength of substituting the mean of a person’s other responses: It ensures that people who skip a lot of questions and give negative responses to the ones they do answer won’t end up resembling the rest of the survey population, who may have given very positive responses. It might be more reasonable to assume that a person’s responses, had they bothered to offer them, would have been consistent with the responses they did offer – more reasonable, surely, than assuming they would be consistent with how the survey population in general responded. So perhaps incorporating survey data into a regression model requires two stages of missing-value replacement: 1) same-person mean response substitution (a la Prof. Knapp, above) for respondents who failed to answer all the questions, and 2) mean-value substitution for all the rest of the population who did not participate in the survey or were not invited. If that sounds like a lot of bother, yes it is. But knowing how darned predictive survey data is, I would be willing to go to the extra trouble to get it right.

1 April 2010

Does “no children” really mean Planned Giving potential?

Filed under: Planned Giving, Predictor variables, Surveying — Tags: , , , — kevinmacdonell @ 11:29 am

I gave a presentation to fundraising professionals and other nonprofit types recently, and I spent a little time discussing my work with predicting Planned Giving potential. One of the attendees asked if I was aware of a recent study that found that the most significant predictor for Planned Giving was the absence of children.

I had, and in my (not very coherent) response I said something to the effect that although this was interesting, I had reservations about taking an observation based on other institutions’ populations and applying it to ours. I would prefer to test it, I said. (I believe that someone else’s valid observation about their own data is only an assumption when applied blindly to mine.) And then I said that we don’t have the data to begin with.

But as I was talking, a thought occurred to me: Yes, in fact we DO have child data! I had even used that data in my PG model, but it had never occurred to me to study it very closely.

Back in the spring of 2009, our school conducted an extensive online survey of alumni as part of a national benchmarking study of alumni engagement. One of the core questions (supplied by the study firm, Engagement Analysis Inc.) asked specifically about likelihood to consider a bequest. Another question, which we added ourselves, asked respondents how many children they had under the age of 18. (We had a purpose in asking about “under 18”, and it wasn’t Planned Giving. Had I specifically been seeking a PG predictor, I would not have qualified the statement. Presumably the positive “childless effect” is explained by the lack of need to divide an estate up among children, regardless of their age.)

Our response rate was very high, and quite representative of our alumni population. Standing there in the midst of my presentation, I realized I had enough information to test the ‘childless’ theory in the environment of our own data.

The chart below shows survey responses to the PG question on the horizontal axis. The question was actually a scale statement which indicated that the responder was very likely to leave a bequest to our institition. Possible answers ranged from 1 to 6, with a one meaning “strongly disagree” and a six meaning “strongly agree”. If the respondent did not answer the question, I coded it as zero so it would show up on my chart.

In the chart, each group of respondents (i.e., each vertical bar) is segmented according to their answer on the “children” question. Notice the relative size of the blue segments, the responders who have no children under 18. For the proportion of this segment, there is a difference of approximately ten percentage points between the “strongly agree” group and the “strongly disagree” group.

In other words, childless alumni in our survey data set ARE more receptive to considering Planned Giving.

I said earlier that the survey response was representative of our alumni population. Therefore, many of the responders are far too young to be considered prospects. So I made another chart, which shows only alumni in the older half of the population: Class year 1990 and earlier. The difference between these two charts will seem subtle because they’re busy-looking, so let me point it out to you: Now the gap between the “strongly disagree” and the “strongly agree” for people with no kids has widened to 15 percentage points. This is a vote of confidence in favour of using “number of children” as a predictor of PG receptivity.

But here’s a question: Can you use child data to segment your prospect pool, and thereby avoid having to engage in predictive modeling? My answer is “No.” In both of the charts above, a majority of respondents answered “no children”, regardless of their attitude to Planned Giving. Yes, there’s a difference among the groups, but although it is significant, it is not definitive.

Others may quibble, saying that the data is suspect because we only asked about children under 18. But I really think this predictor is a lot like certain other conventional predictors, the ones related to frequency and consistency of giving: Alone, they are not powerful enough to isolate your best PG prospects. Only when you combine them with the full universe of other proven predictors in your database (event attendance, marital status, etc.) will you end up with something truly useful.

1 February 2010

Surveys and missing data

Filed under: Model building, Pitfalls, Surveying — Tags: , , , , — kevinmacdonell @ 9:24 pm

Survey data is valuable for predicting mass behaviours. Inevitably, though, some pieces will be missing. (Photo by Photo Monkey, used by Creative Commons license. Click photo for more.)

In a previous post I talked about the great predictive power of survey responses. Today I’ll explain what to do about one of the roadblocks you’ll encounter – missing data.

The problem of missing data is a big issue in statistics, and a number of techniques are available for dealing with it. The ideas I offer here may or may not meet the high standards of a statistician, but they do offer some more or less reasonable solutions.

Let’s say you’re interested in creating only one variable from the survey, an indicator variable which records whether a person participated or not. The mere presence or absence of this data point will probably be predictive. You could code all responders as ‘1’, and everyone else as ‘0’. This would work well if a large portion of your sample received an invite.

Alternately, you could code responders as ‘1’, non-responders as ‘-1’, and put the “uninvited,” everyone who didn’t receive an invitation, into the neutral middle zone by giving them a zero. To avoid negative numbers, just add one to each of these values; it may seem strange to reward the uninvited a ‘1’, but what you’re trying to do here is see if they differ from the people who actually had a chance to participate and chose not to: An action taken, in a negative direction.

Use only one or the other of these two variables, whichever one ‘works’. Test both against mean and median lifetime giving (or whatever your dependent variable is). If the three-level variable shows a nice linear relationship with your DV – with low giving for the non-responders, higher giving for the uninvited, and highest giving for the responders – then use that variable in your regression.

Statistically sound? Perhaps not. But if the alternative is tossing out a potentially valuable predictor, I don’t see the harm.

That covers missing data for the simple fact of participation / non-participation in a survey. You can go much deeper than that. A typical survey of alumni will yield many potential predictor variables. If your survey is getting at attitudes about your institution, or about giving or volunteering, or attending events, responses to individual questions can be powerfully predictive. Again, if you’re using regression, all you need to do is find a logical way to re-express the response as a number.

For example, you can recode yes/no questions as 1/0 indicator variables. If the responses to a question are categorical in nature (for a question such as, “What is your mother’s nationality?”), you may wish to test indicator variables for the various responses, and along with that, an indicator variable for “Did not answer the question.” In such a case, missing data may have its own underlying pattern (i.e. it is non-random), and may correlate with the value you’re trying to predict.

Sometimes the data is already expressed as a number. Surveys often use Likert scale questions, in which responders are asked to rate their level of agreement to a statement. Typical responses might range from 1 (“strongly disagree”) to 5 or 6 (“strongly agree”). Likert scales are, strictly speaking, ordinal in nature, not continuous: There’s no reason to believe that the “distance” between ‘1’ and ‘2’ is the same as the “distance” between ‘2’ and ‘3’. However, I accept them as a logical ranking, perfectly suitable for a regression independent variable.

(Incidentally, a survey designer will use an even-numbered scale if he wants to disallow neutral responses, a practice which gets an answer but sometimes causes frustrated survey-takers to skip questions.)

So: rich numerical data, ready to plug into our analysis, but only one problem: Missing data. Some people skip questions, some fail to take the survey, others were never invited in the first place. This time, you can’t just plug a zero into every empty space. Innocent non-responders would come off as a very negative bunch, completely throwing off your predictor. But again, if you don’t have some type of number present for that variable for all cases in your dataset, they’ll be excluded from the regression analysis. At the risk of oversimplifying, I would say you’ve got three options, with three levels of sophistication:

  1. Substitution of a neutral value.
  2. Substitution of a mean value.
  3. Imputation.

1. Neutral-value substitution. This is the method I used the first time I incorporated a lot of Likert-scale type data into a predictive model. It was very simple. Every person with a missing value for a given question received a value falling perfectly halfway between “strongly disagree” and “strongly agree.” For a scale of 1 to 6, that value is 3.5. Of course, ‘3.5’ was not a possible choice on the survey itself, which forced respondents to commit to a slightly negative or positive response, but that didn’t mean I couldn’t use the middle value to replace our unknown values.

There is one problem with this method, though … if you think about it, what’s so ‘neutral’ about 3.5? If you took all the actual responses and calculated the average, it might be significantly higher or lower than 3.5. Let’s say actual respondents had an average response of 5 for a particular question. If we code everyone else as 3.5, that’s characterizing them as negative, in relation to the respondents. We may have no basis for doing so.

2. Mean-value substitution. The problem I describe above can be addressed by mean-value substitution, which is the method I perhaps should have used. If the average response for your actual respondents is 2.67, then substitute 2.67 for all your missing values. If it’s 5, use 5. (If your response data is not Likert-scale in nature, but rather contains extreme values, use the median value for the variable rather than the average value.)

3. Imputation. This term is used to describe a variety of related methods for guessing the “most likely” missing value based on the values found in other variables. These methods include some advanced options made available in software such as SAS and SPSS.

The third option may be regarded as the best from a statistical point of view. Alas, I have not used these more advanced techniques. I can only speak from my experience with the first two. For now at least, I accept the drawbacks of substituting the population mean for missing data (one drawback being a gross underestimation of variance), in order for me to quickly and easily tap the power of survey data in my models.

What would YOU do?

31 January 2010

Using survey data in regression models

Filed under: Predictor variables, Surveying — Tags: , — kevinmacdonell @ 9:36 pm

Surveys can provide a rich load of fresh data you can incorporate into your models. The very act of agreeing to participate in a survey is a trait likely to be highly predictive, regardless of the model you’re building.

If you work at a university, be attuned to people at your institution who might be surveying large numbers of alumni, and encourage them to make their surveys non-anonymous. They’ll get much richer possibilities for analysis if they can relate responses to demographic information in the database (class year, for example). Remind them that people aren’t necessarily put off by non-anonymous surveys; if they were, restaurants, retailers and other private-sector corporations wouldn’t bother with all the customer-satisfaction surveying that they do. Non-anonymity is a basic requirement for data mining: If you don’t know who’s giving the answers, you’ve got nothing.

Your database provides the ideal key to uniquely identify respondents. It doesn’t even have to be a student ID. The unique ID of each person’s database record (if you use Banner, the PIDM) is perfect: It’s unique to the individual, but otherwise it’s meaningless outside of the database. No one outside your institution can link it to other data, so there is no privacy issue if you incorporate it in a mail-merged letter or email inviting people to participate. It can even be added to the printed label of an alumni magazine.

If you’ve got good email addresses for a sizable chunk of your alumni, you’ve got what you need to provide a unique ID you can email to each person to log into a survey online – without requiring them to provide their name or any other information you’ve already got in your database. (A cheap software plug-in for Outlook does a fine job of automating the process of mail-merges.)

I said, “surveying large numbers of alumni”. That’s important. A survey directed solely at the Class of 1990, or only at the attendees of the past Homecoming, is of limited use for modeling. A broad cross-section of your sample should have had at least the opportunity to participate. Otherwise, your variable or variables will be nothing more than proxies for “graduated in 1990” and “attended Homecoming.”

But probably you’re not inviting every living alumnus/na to participate. And even if you did, most of them wouldn’t take part. This creates a problem with your subsequent model building: missing data. If you use multiple regression, your software will toss out all the cases that have missing data for any of the predictor variables you pull in. You’ve got to put something in there, but what?

I’ll you what, in the next post!

11 January 2010

The 15 top predictors for Planned Giving – Part 3

Okay, time to deliver on my promise to divulge the top 15 predictor variables for propensity to enter a Planned Giving commitment.

Recall the caveat about predictors that I gave for Annual Giving: These variables are specific to the model I created for our institution. Your most powerful predictors will differ. Try to extract these variables from your database for testing, by all means, but don’t limit yourself to what you see here.

In Part 2, I talked about a couple of variables based on patterns of giving. The field of potential variables available in giving history is rich. Keep in mind, however, that these variables will be strongly correlated with each other. If you’re using a simple-score method (adding 1 to an individual’s score for each positively-correlated predictor variable), be careful about using too many of them and exaggerating the importance of past giving. On the other hand, if you use a multiple regression analysis, these related variables will interact with each other – this is fine, but be aware that some of your hard-won variables may be reduced to complete insignificance.

Just another reason to look beyond giving history!

For this year’s Planned Giving propensity model, the predicted value (‘Y’) was a 0/1 binary value: “1” for our existing commitments, “0” for everyone else. (Actually, it was more complicated than that, but I will explain why some other time.)

The population was composed of all living alumni Class of 1990 and older.

The list

The most predictive variables (roughly in order of influence) are listed below. Variables that have a negative correlation are noted N. Note that very few of these variables can be considered continuous (eg. Class Year) or ordinal (survey scale responses). Most are binary (0/1). But ALL are numeric, as required for regression.

  1. Total lifetime giving
  2. Number of Homecomings attended
  3. Response to alumni survey scale question, regarding event attendance
  4. Number of President’s Receptions attended
  5. Class Year (N)
  6. Recency: Gave in the past 3 years
  7. Holds another degree from another university (from survey)
  8. Marital status ‘married’
  9. Prefix is Religious (Rev., etc.) or Justice
  10. Alumni Survey Engagement score
  11. Business phone present
  12. Number of children under 18 (from survey) (N)

Like my list of Annual Giving predictors, this isn’t a full list (and it isn’t 15 either!). These are the most significant predictors which don’t require a lot of explanation.

Note how few of these variables are based on giving – ‘Years of giving’ and ‘Frequency of giving’ don’t even rate. (‘Lifetime giving’ seems to take care of most of the correlation between giving and Planned Giving commitment.) And note how many variables don’t even come from our database: They come from our participation in a national survey for benchmarking of alumni engagement (conducted in March 2009).

« Newer PostsOlder Posts »

Create a free website or blog at