CoolData blog

22 June 2010

Making “Email present” predictive again

Filed under: Alumni, Predictor variables — Tags: , , , — kevinmacdonell @ 5:27 am

Gmail, Yahoo, Hotmail - their presence as an alum's preferred email address is a negative predictor of giving. (Creative Commons license. Click image for source.)

“Email present” (0/1) seems to be breathing its last as a good predictor of giving. Even if you do find some positive correlation between having an address and giving, it probably pales in comparison to other contact-related variables such as the presence of home phone or business phone.

Is there anything we can do to save it? I think so.

The key is in how people use email. I don’t have hard data on this, but it’s my impression that most people have two addresses, and many have multiple addresses. One might be their work email. Another might be their personal “home” address, shared with other members of the household. And frequently there will be a third address, also personal, but more “public” than the home email and reserved for messages that the recipient considers relatively unimportant.

If it seems easier than ever to collect email addresses, that’s because it is. Only instead of getting a work address, you’re more likely to get a personal address, and it will probably be of the third kind: A spam account, which the recipient may or may not be checking. This is the account that a person will use to sign up for things online, or to enter contests, or to subscribe to various things — anytime one expects to receive followup or advertising messages that would be unwelcome in a workplace account. These are typically Gmail or Hotmail accounts; because they have practically unlimited quota, you’ll never have a bounce-back due to a full inbox. Your database will steadily accumulate a trove of useless contact information.

Like alumni with a business phone in your database, alumni who share their business or work email address are keen to hear from you. This is probably less true for alumni who would rather receive mail at their “home” address. And it’s least true for alumni who shunt your messages to a low-priority, free account.

But unlike phone numbers, email addresses don’t have a code in your database to indicate the context (business, home, seasonal, mobile). It may be impossible to tell. However, it’s not hard to screen out the low-value accounts. After all, the field is dominated by a handful of likely suspects, two of which I’ve already named.

Step one is to pull all the valid email addresses from your database, plus a column for Lifetime Giving. Paste this data into an Excel spreadsheet. Insert a new column to the right of the Email Address column, and give it a label called “Domain.” In this column, you’re going to capture everything to the right of the @ symbol, i.e. the domain name. The formula will look something like this:


Copy the formula for the whole length of your spreadsheet, and inspect the results to ensure you’ve isolated just the domain names. The rest of your analysis is easiest to do in stats software. Copy your columns, including Lifetime Giving, into the stats package and create a sorted frequency table to see which domains are most common in your database.

You’ll probably discover that fewer than half a dozen companies account for three-quarters of addresses. Do a little recoding of the variable to gather together variations on the same domain — and, differences in capitalization, and so on. Recode all remaining addresses as “Other”, and missing addresses as “None”. Then check how each domain category compares as far as giving is concerned.

The following are the seven most common email domains in our database, plus ‘None’ and ‘Other’, sorted in descending order by average lifetime giving. (Averages include non-donors.)

Notice how those familiar generic email domains sit right at the bottom; alumni with no email at all have average giving that is five times that of Hotmail account holders! At the more generous end of the scale are domains which are popular for home email accounts in our city, hinting at a geographic influence, but also pointing to their superiority over the generics.

But it is the Other category that I think is most interesting. Once we’ve screened out a large portion of the generic and home email addresses, we’re left with a segment that is a much richer vein for business and employment-related addresses. These are most likely to be active accounts, probably with strict quotas, that get checked every day by people who actually read our messages.

Have a look at this. Here are Pearson’s r values for the strength of correlation between three different email-related variables and Lifetime Giving (log-transformed). ‘Top email domains‘ (0/1) consists of Other and a couple of the more generous domains above. ‘Bottom email domains‘ (0/1) consists of Yahoo, Gmail and Hotmail addresses. ‘Email address present‘ (0/1) is just that: Is there or is there not an email (any email) present.

It used to be that I would go with ‘Email address present’ as my predictor variable, but look how it pales in comparison with the other two! In place of a very weak predictor variable we now have one strong positive predictor and one strong negative predictor.

It’s always hard to say what will become of a correlation when everything is put together in an actual model, but so far I’m finding that both of the new variables are holding up very well in multiple regression, independently maintaining very low p-values. Email has been rescued from oblivion!

P.S.: Just became aware of a site called tempalias.This is a free service that provides temporary, throwaway email addresses, for people who want to sign up for an online service or community without providing a real email address. The user can set a maximum number of days or messages for which the tempalias will be valid, after which it is automatically deleted. Mail is forwarded to a person’s real address for as long as he or she figures it needs it to be, and no longer. Do you have any of these addresses in your database? If so, they’re serving pretty much the same purpose as a lot of the Yahoo, Gmail, and Hotmail addresses you also have!

P.P.S.: Just this past December (2010) I watched a presentation given by a vendor, a predictive analytics software company, which broke down ’email’ in exactly this way as an example of an analysis of a predictor variable. Yahoo, Hotmail and Gmail addresses were found to be negatively predictive in the data set they were using. After this independent example, I would not be surprised if this held true for many other data sets as well.



  1. Nice job on this one!

    Comment by Jeff Jetton — 22 June 2010 @ 11:47 am

  2. Eagle-eyed readers will notice that the email domain counts in the first table don’t add up – they’re almost 3,600 short because I deleted a line in the table (emails belonging to our own university’s domain) but the total did not update. C’est la vie.

    Comment by kevinmacdonell — 22 June 2010 @ 5:23 pm

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at

%d bloggers like this: