# CoolData blog

## 17 June 2010

### Is distance from campus correlated with giving?

Filed under: Alumni, Coolness, Predictor variables — Tags: , — kevinmacdonell @ 6:26 am

I’ve long been intrigued by the idea that there might be a correlation between distance from campus and giving to the university, as some people insist there is. And I’m doubly intrigued by the possibility that the correlation is not linear, but roughly defined by some kind of curved function. I’ve had no easy way to figure this out — until I enlisted the help of someone smarter than me.

The idea that there’s an association can be found in many sources. Phonathon expert and consultant Jason Fisher asserts that, possibly without exception, alumni participation in the annual fund increases with distance from campus (link). As far back as 1992, Wesley Lindahl and Christopher Winship reported that three studies showed no predictive effect of distance from the school, one indicated that living farther away was a predictor, and one that living closer was a predictor (link).

The notion that the correlation may be curvilinear is suggested by something Ray Satterthwaite of Engagement Analysis Inc. has told me. His extensive studies of alumni engagement reveal a pattern: engagement is often high in the vicinity of campus, takes a dip in the middle distances (100 to 250 km out, for instance), then increases markedly at the farthest distances. Not all schools follow the same pattern, but this trend is typical. If engagement follows a curve based on distance, then giving probably does, too.

How can you figure this out? And of what use would this information be?

To start with the first question: In order to analyze ‘distance from campus’ we need some way to append a figure for ‘distance’ to every individual in our data set, in order to see how that figure relates to giving. By ‘distance’ I mean direct or “as the crow flies,” not by driving distance. So we’re talking about measuring from one point on the globe to another; the first point will be where the alum’s primary address is located, and the second point is your campus.

Points on the globe are expressed in degrees of latitude and longitude. Once you’ve figured out everyone’s latitude and longitude (including that of your own campus), it’s possible to calculate the distance between the two points.

Here’s where I had to enlist help. If you work at a larger institution, you may be lucky enough to have academic researchers or even staff members working in the field of mapping and GIS (geographical information systems). Our department happens to have a web developer and programmer named Peng Li, and I was vaguely aware that he was working with geographic information. So I asked him if he could help me with a little experiment; if I sent him a batch of North American alumni addresses, could he geocode them for me and send me back distance information? (“Geocode” is the fancy term for determining location data — i.e., latitude and longitude — from other geographic data such as addresses at the street level.)

Peng told me to send him a sample, so I randomly selected about 1,500 American and Canadian addresses. The address data consisted of an Excel file containing nothing but Banner ID and Street, City, State/Prov, and Postcode/ZIP. I included a column for Country, as I suspected Canadian and U.S. addresses might be geocoded differently.

I expected that many addresses wouldn’t geocode properly and would be rejected (post office boxes and so on), but the file I received back had 100% of the addresses geocoded. When I sorted by distance, alumni from just down the street (0.21 km away) were at the top of the list, and alumni out in Alaska (5,626 km away) were at the bottom.

Being a programmer, and just helpful in general, Peng went to the trouble of writing a PHP script to handle the data. The script accessed the Google geocoding service, sent the address data file to Google one row at a time, and created an output file with  fields for latitude, longitude and distance. The Google service has a limit of 2,500 addresses every 24 hours, so Peng also built in the ability to access the Yahoo service, which has a limit of 5,000 per day. Now, whenever I want, I can use Peng’s uploader to send batches of addresses to either service. Batch-geocoding every alumnus and alumni is going to take some time, though: Peng advised me that every input data file should be less than 1,000 records in size, to avoid long waits and server errors.

First, I matched up the sample distance data with lifetime giving for each person. The scatterplot below shows lifetime giving on the Y axis and distance from campus on the X axis. The values for both giving and distance are log-transformed, otherwise the plot would look like a giant clump of points and there would be no visible correlation. (Both giving and distance have distributions that are skewed to very low values, with relatively few very large values — perfect for logarithmic transformation.)

Scatterplot: Lifetime giving and Distance from campus. Click for full size.

The points appear to be arranged in vertical bands – probably due to geography (the blank areas are distance bands that contain relatively few alumni). The solid bar of points at the bottom are all non-donors. Aside from that, it would be difficult to discern any pattern, given how many points overlap, without adding a regression line, which we’ve done here. The gentle upward slope of the line indicates a relatively weak positive linear correlation.

Pearson’s r for the two variables is 0.038. That puts ‘log(distance)’ somewhere in the middle of the pack among predictor variables for strength of correlation with ‘log(giving)’. That’s not bad, but it’s not wildly exciting. Nor is it really worth going to a lot of trouble to acquire this data for predictive purposes, especially considering that other, more easily-obtained variables related to geography might work just as well.

But what about non-linear patterns? The stats software I use, called DataDesk, allows me to lay another type of line over my data. “Smoothers” attempt to trace trends in data by following a calculated path among the dots in a scatterplot. This line-fitting exercise is useful for spotting trends or correlations in sequential data which would otherwise remain hidden in a haze of data points. The most common example of data in sequence is time-series data, but any data that is somehow ordered can be smoothed. “Distance from campus” is a good candidate.

There are different types of smoothers. Some try to trace every drop and uptick in the data, and produce very spikey lines. Others are less sensitive to spikes, and produce wiggly lines which are still faithful to the underlying data. The least sensitive type is called a “lowess” smoother. I won’t go into how it’s calculated; it’s enough to know that the result is a wavy curve that shows the overall pattern in data, and I found that it gave the most comprehensible picture of how distance relates to giving.

Scatterplot with lowess line (smoother) added. Click for full size.

The effect still seems rather subtle, but this looks like what I was expecting: Variations in giving with distance from campus. As with measures of alumni engagement at other schools, giving seems to start at a higher level near the campus, then dips in the middle-distance before rising again at farther distances (with a leveling-off after that).

What about other types of smoothers? DataDesk offers three: lowess, trewess and median smoothers. Here is the scatterplot with a trewess line added.

Scatterplot with trewess smoother. Click for full size.

This smoother is more highly influenced by extreme points in the data, so it’s not as gentle as the lowess line, and there are dramatic spikes wherever a single point exerts undue influence. This is partly a result of having a very small sample (leading to “thin spots” in the data), but in general it doesn’t seem that this smoother is very useful, except perhaps for identifying certain cities or regions that have elevated levels of giving.

Let’s return to the lowess smooth trace for a moment. In the scatterplot below, I have used a selection tool in DataDesk to highlight the alumni living in two bands of distances from campus who seem to be associated with higher levels of giving. I’ve highlighted those points in green.

Click for full size.

At this point, I’m not even sure where these people live. All I know is that these 650 alumni, as a group, seem to have elevated giving. “Seem” is the operative word, because when I isolate these alumni and actually compare their average lifetime giving to that of everyone else, there is no difference whatsoever. The effect we thought we saw, subtle as it was, vanishes when we look at the numbers. This is probably because I’ve selected too many points, or the wrong points, to isolate the true givers.

We can look at it another way, though. The numbers for average lifetime giving that I mentioned above include the ‘zeroes’ (alumni who gave nothing). What if we looked just at participation, instead of dollars given? When I code everyone as either a donor (1) or a non-donor (0), and put that variable on the Y axis of the scatterplot, the smooth trace is still suggesting the same idea, that alumni in a certain distance band are better donors:

Scatterplot: 'Is donor' and Distance. Click for full size.

When I look at the participation numbers, I see that 36% of alumni in the “green zone” (the prominent hump in the lowess line) are donors, compared with 31% of all other alumni. Again, this is a rather subtle difference and not something I’m going to spend a great deal of time coding variables for.

(Interestingly, Peng Li himself conducted a Google Earth-related project that found that our home city has the highest donor rate among all cities in North America that have a concentration of our alumni. Most other cities have donor rates that are at about the same level, regardless of distance.)

A few conclusions, observations, and ideas for further study:

• Saying that giving always goes up markedly with distance from campus is probably overstating the case. (Sorry, Jason Fisher.)
• Certain cities or regions might be characterized by elevated levels of giving, and these might show up on a geocode scatterplot — but there are simpler, more direct ways to isolate these geographic pockets of generous alumni.
• Trends are probably highly influenced by single high-value donors in areas where few alumni live. Larger sample sizes would help even things out.
• Considering differences in philanthropic culture between Canada and the U.S., it might be wise to analyze each nation separately.

I’m not aware of any fundraisers making use of GIS data. Although our experiment did not really lead to any predictive breakthroughs, I think this area has great potential. This is “cool data”. Do you have a GIS project in the works? I’d like to hear about it.

1. Nicely explained, Kevin! I love working with GIS data and regularly use ArcDesktop to create distribution maps when I do descriptive stats of prospects for different units on campus. As for modeling, I do sometimes use distance from campus as a variable. It may not be that predictive for overall giving to the university (as you show above), but it can be a good variable for giving to certain units: our museums and performing arts center, for example, since they so clearly serve the local community directly. I can’t recall for sure off the top of my head, but I think it was also somewhat predictive for our affiliated hospital as well. Have you tried parsing out distance as a variable for specific initiatives rather than just general giving?

Comment by Audrey Geoffroy — 17 June 2010 @ 10:59 am

• Hi Audrey,

Interesting that you mention ArcDesktop. Our university has a site license for the software, so I recently downloaded it in hopes of seeing what I could get out of it. It’s a full-featured package for sure, so in all the complexity I never made any headway (not that I tried very hard).

No, I have not yet tried distance as a predictor of giving for specific designations/faculties/initiatives. But it’s a good idea. I can think of some areas of interest that might be very sensitive to proximity — support for Athletics, for example. I would expect that to be very much tilted in favour of nearness to campus.

Thanks!

Comment by kevinmacdonell — 17 June 2010 @ 11:39 am

2. I would love to have every address in my database geocoded, but we have some privacy issues with exposing our data to Google, Yahoo!, etc.

That said, I’ve found a way to do a sort of poor man’s distance calculation “in-house. Basically, you obtain a table of ZIP codes with the latitude and longitude of the center of the ZIP area. The Census website has this data for free, for example. Then you just match that up with your address file by ZIP.

This won’t get you address-level accuracy of course. But it will get you within a few miles, and that’s good enough for this type of analysis.

– Jeff

Comment by Jeff Jetton — 17 June 2010 @ 1:55 pm

• That’s the same thing I did to calculate distance in the beginning, Jeff (we don’t upload to Google et al either, for the same reason you don’t). Eventually, we did get address level geo-codes from a vendor, but I’m not sure who it was since I wasn’t involved in that particular screening. Like you said – not a real difference between the two for this type of analysis.

Comment by Audrey Geoffroy — 17 June 2010 @ 8:24 pm

• That’s great – address-level accuracy is not essential, so if the bulk of your addresses are in the US, this should work perfectly. Here in Canada we don’t have access to anything similar for postal codes, that I’m aware of. There is also so much demographic data available by ZIP. Statistics Canada is supposedly one of the best stats-gathering organizations in the world, but I’ve never been able to access or render usable ANY StatsCan data, because census zones don’t bear much or any relation to any conventional geographic information that we might have in our database (such as postal code).

Comment by kevinmacdonell — 18 June 2010 @ 7:07 am