CoolData blog

17 December 2010

Chart frequency of words and terms from books with Google

Filed under: Coolness, Free stuff, Fun — Tags: , — kevinmacdonell @ 11:46 am

Here’s another interesting bauble from the nerds at Google. The Books Ngram Viewer allows you to plot the frequency of words and phrases that appeared in books published in the past few hundred years. Google estimates they’ve scanned and OCR’d more than 10 percent of all the books ever published, and this plotter is based on a sample of that data.

This “most excellent time-wasting tool” was blogged about by Alexis Madrigal, a senior editor for TheAtlantic.com, in his post, The Decline of Man (as a Word), in which he shows how the word “man” has fared against “woman”. (Not well.) As Madrigal observes, this may not serve a legitimate research purpose, but it sure is fun.

Here’s a sample. I’ve searched for the term “database”, and set the years to search as 1950 to 2008. The y-axis shows the percentage of all the terms contained in Google’s sample of books written in English that are “database” for those years. As you can see, the word didn’t emerge in published sources before the early 1970s. (Click image for full size.)

The tool also allows you to plot the progress of one term against another. If you plot “database” against “data base”, you’ll see that the two-word term enjoyed a short life before the single word took over. I’ve been interested in the use of the word “gift” instead of “donation,” but the plot of those two words isn’t very informative due, I guess, to the many connotations of the word “gift.” Instead I plotted “charitable gift” and “charitable donation” to put the words in context, and came up with this chart. The concept of giving seems to have had quite a heyday up until around 1835, and “donation” was firmly in the lead. By 1880, though, it was all about the gift. (Click image for full size.)

That got me thinking about how well “philanthropy” has done through the years. Mentions before 1750 are rare, so I plotted from then to the present, and once again the first half the 19th century seems to have been relatively more preoccupied with the idea than later on. (Although, of course, who knows what data this is really based on. As I said, it’s fun, but I wouldn’t want to base a thesis on it without knowing more about the underlying data.)

Hmm – this IS fun. What if we plot poverty vs. religion vs. education? This doesn’t tell us what people were giving to, but it does give a glimpse into what they were writing about. “Poverty” has stayed relatively constant since 1750, but look at how “religion” has declined as “education” has risen. One line crosses the other right at 1909. Also interesting is that the trend started reversing direction about 10 years ago.

And finally, this chart plots “data mining” and two variations of “fundraising“. Data mining takes off as a published term in the early 1990s, and the term “fund raising” has merged into the single word, “fundraising.”

All sorts of fun. Try some for yourself! I’d be interested in hearing about any cool combos you come up with that relate to analytics and/or fundraising.

9 July 2010

How to infer age, when all you have is a name

Filed under: Coolness, External data, Non-university settings, Predictor variables — kevinmacdonell @ 6:02 am

I rarely post on a Friday, let alone a Friday in the middle of summer, but today’s cool idea is somewhat half-baked. Its very flakiness suits the day and the weather. Actually, I think it has potential, but I’m interested to know what others think.

For those of us in higher-ed fundraising, ‘age’ or ‘class year’ is a key predictor variable. Not everyone has this information in their databases, however. What if you could sort of impute a “best guess” age, based on a piece of data that you do have: First name?

Names go in and out of fashion. You may have played around with this cool tool for visualizing baby-name trends. My own first name, Kevin, peaked in popularity in the 1970s and has been on a downward slide ever since (chart here). I was born in 1969, so that’s pretty close. My father’s name, Leo, has not been popular since the 1920s (he was born in 1930), but is having a slight comeback in recent years (chart here).

As for female names, my mother’s name, Yvonne, never ranked in the top 1,000 in any time period covered by this visualization tool, so I’ll use my niece’s name: Katelyn. She was born in 2005. This chart shows that two common spellings of her name peaked around that year. (The axis labeling is a bit wonky — you’ll have to hover your cursor over the display to get a good read on the timing of the peak.)

You can’t look up every first name one by one, obviously, so you’ll need a data set from another source that relates relative frequencies of names with age data. That sort of thing might be available in census data. But knowing somebody with access to a higher-ed database might be the easiest way.

I’ve performed a query on our database, pulling on just three fields: ID (to ensure I have unique records), First Name, and Age — for more than 87,000 alumni. (Other databases will have only Class Year — we’re fortunate in that we’ve got birth dates for nearly every living alum.) Here are a few sample rows, with ID number blanked out:

From here, it’s a pretty simple matter to copy the data into stats software (or Excel) to compute counts and median ages for each first name. Amazingly, just six first names account for 10% of all living, contactable alumni! (In order: John, David, Michael, Robert, James, and Jennifer.)

On the other hand, a lot of first names are unique in the database, or nearly so. To simplify things a bit, I calculated median ages only for names represented five or more times in the database. These 1,380 first names capture the vast majority of alumni.

The ten “oldest” names in the database are listed in the chart below, in descending order by median age. Have a look at these venerable handles. Of these, only Max has staged a rebound in recent years (according to the Baby Names visualizer).

And here are the ten “youngest names,” in ascending order by median age. It’s an interesting coincidence that the very youngest name is Katelyn — my five-year-old niece. One or two (such as Jake) were popular many years ago, and at least one has flipped gender from male to female (Whitney). Most of the others are new on the scene.

The real test is, do these median ages actually provide reasonable estimates of age for people who aren’t in the database?

I’m not in the database (as an alum). There are 371 Kevins in the database, and their median age is 43. I turned 41 in May, so that’s very good.

My father is also not an alum. The 26 Leos in the database have a median age of 50, which is unfortunately 30 years too young. Let’s call that one a ‘miss’.

My mother’s ‘predicted’ age is off by half that — 15 years — that’s not too bad.

Here’s how my three siblings’ names fare: Angela (predicted 36, actual 39 — very good), Paul (predicted 48, actual 38 — fair), and Francis (predicted 60, actual 36 — poor). Clearly there’s an issue with Francis, which according to the Baby Names chart tool was popular many decades ago but not when my brother was named. In other words, results for individuals may vary!

So let’s say you’re a non-profit without access to age data for your database constituents. How does this help you? Well it doesn’t — not directly. You will need to find a data partner at a university who will prepare a file for you, just as I’ve done above. When you import the data into your model, you can match up records by first name and voila, you’ve got a variable that gives you a rough estimate of age. (Sometimes very rough — but it’s better than nothing.)

This is only an idea of mine. I don’t know if anyone has actually done this, so I’d be interested to hear from others. Here are a few additional thoughts:

  • There shouldn’t be any privacy concern — all you want is a list of first names and their median ages, NOT IDs or last names — but be sure to get all necessary approvals.
  • To anticipate your question, no, I won’t provide you my own file. I think you’d be much better off getting a names file from a university in your own city or region, which will provide a more accurate reflection of the ethnic flavour of your constituency.
  • I used “First name”, but of course universities collect first, middle and last names, and the formal first name might not be the preferred moniker. If the university database has a “Preferred first name” field that is fully populated, that might be a better option for matching with your first-name field.
  • Again, there might be more accessible sources of name- or age-related data out there. This idea just sounded fun to try!

17 June 2010

Is distance from campus correlated with giving?

Filed under: Alumni, Coolness, Predictor variables — Tags: , — kevinmacdonell @ 6:26 am

I’ve long been intrigued by the idea that there might be a correlation between distance from campus and giving to the university, as some people insist there is. And I’m doubly intrigued by the possibility that the correlation is not linear, but roughly defined by some kind of curved function. I’ve had no easy way to figure this out — until I enlisted the help of someone smarter than me.

The idea that there’s an association can be found in many sources. Phonathon expert and consultant Jason Fisher asserts that, possibly without exception, alumni participation in the annual fund increases with distance from campus (link). As far back as 1992, Wesley Lindahl and Christopher Winship reported that three studies showed no predictive effect of distance from the school, one indicated that living farther away was a predictor, and one that living closer was a predictor (link).

The notion that the correlation may be curvilinear is suggested by something Ray Satterthwaite of Engagement Analysis Inc. has told me. His extensive studies of alumni engagement reveal a pattern: engagement is often high in the vicinity of campus, takes a dip in the middle distances (100 to 250 km out, for instance), then increases markedly at the farthest distances. Not all schools follow the same pattern, but this trend is typical. If engagement follows a curve based on distance, then giving probably does, too.

How can you figure this out? And of what use would this information be?

To start with the first question: In order to analyze ‘distance from campus’ we need some way to append a figure for ‘distance’ to every individual in our data set, in order to see how that figure relates to giving. By ‘distance’ I mean direct or “as the crow flies,” not by driving distance. So we’re talking about measuring from one point on the globe to another; the first point will be where the alum’s primary address is located, and the second point is your campus.

Points on the globe are expressed in degrees of latitude and longitude. Once you’ve figured out everyone’s latitude and longitude (including that of your own campus), it’s possible to calculate the distance between the two points.

Here’s where I had to enlist help. If you work at a larger institution, you may be lucky enough to have academic researchers or even staff members working in the field of mapping and GIS (geographical information systems). Our department happens to have a web developer and programmer named Peng Li, and I was vaguely aware that he was working with geographic information. So I asked him if he could help me with a little experiment; if I sent him a batch of North American alumni addresses, could he geocode them for me and send me back distance information? (“Geocode” is the fancy term for determining location data — i.e., latitude and longitude — from other geographic data such as addresses at the street level.)

Peng told me to send him a sample, so I randomly selected about 1,500 American and Canadian addresses. The address data consisted of an Excel file containing nothing but Banner ID and Street, City, State/Prov, and Postcode/ZIP. I included a column for Country, as I suspected Canadian and U.S. addresses might be geocoded differently.

I expected that many addresses wouldn’t geocode properly and would be rejected (post office boxes and so on), but the file I received back had 100% of the addresses geocoded. When I sorted by distance, alumni from just down the street (0.21 km away) were at the top of the list, and alumni out in Alaska (5,626 km away) were at the bottom.

Being a programmer, and just helpful in general, Peng went to the trouble of writing a PHP script to handle the data. The script accessed the Google geocoding service, sent the address data file to Google one row at a time, and created an output file with  fields for latitude, longitude and distance. The Google service has a limit of 2,500 addresses every 24 hours, so Peng also built in the ability to access the Yahoo service, which has a limit of 5,000 per day. Now, whenever I want, I can use Peng’s uploader to send batches of addresses to either service. Batch-geocoding every alumnus and alumni is going to take some time, though: Peng advised me that every input data file should be less than 1,000 records in size, to avoid long waits and server errors.

But what about that initial random sample of 1,500 test addresses?

First, I matched up the sample distance data with lifetime giving for each person. The scatterplot below shows lifetime giving on the Y axis and distance from campus on the X axis. The values for both giving and distance are log-transformed, otherwise the plot would look like a giant clump of points and there would be no visible correlation. (Both giving and distance have distributions that are skewed to very low values, with relatively few very large values — perfect for logarithmic transformation.)

Scatterplot: Lifetime giving and Distance from campus. Click for full size.

The points appear to be arranged in vertical bands – probably due to geography (the blank areas are distance bands that contain relatively few alumni). The solid bar of points at the bottom are all non-donors. Aside from that, it would be difficult to discern any pattern, given how many points overlap, without adding a regression line, which we’ve done here. The gentle upward slope of the line indicates a relatively weak positive linear correlation.

Pearson’s r for the two variables is 0.038. That puts ‘log(distance)’ somewhere in the middle of the pack among predictor variables for strength of correlation with ‘log(giving)’. That’s not bad, but it’s not wildly exciting. Nor is it really worth going to a lot of trouble to acquire this data for predictive purposes, especially considering that other, more easily-obtained variables related to geography might work just as well.

But what about non-linear patterns? The stats software I use, called DataDesk, allows me to lay another type of line over my data. “Smoothers” attempt to trace trends in data by following a calculated path among the dots in a scatterplot. This line-fitting exercise is useful for spotting trends or correlations in sequential data which would otherwise remain hidden in a haze of data points. The most common example of data in sequence is time-series data, but any data that is somehow ordered can be smoothed. “Distance from campus” is a good candidate.

There are different types of smoothers. Some try to trace every drop and uptick in the data, and produce very spikey lines. Others are less sensitive to spikes, and produce wiggly lines which are still faithful to the underlying data. The least sensitive type is called a “lowess” smoother. I won’t go into how it’s calculated; it’s enough to know that the result is a wavy curve that shows the overall pattern in data, and I found that it gave the most comprehensible picture of how distance relates to giving.

Scatterplot with lowess line (smoother) added. Click for full size.

The effect still seems rather subtle, but this looks like what I was expecting: Variations in giving with distance from campus. As with measures of alumni engagement at other schools, giving seems to start at a higher level near the campus, then dips in the middle-distance before rising again at farther distances (with a leveling-off after that).

What about other types of smoothers? DataDesk offers three: lowess, trewess and median smoothers. Here is the scatterplot with a trewess line added.

Scatterplot with trewess smoother. Click for full size.

This smoother is more highly influenced by extreme points in the data, so it’s not as gentle as the lowess line, and there are dramatic spikes wherever a single point exerts undue influence. This is partly a result of having a very small sample (leading to “thin spots” in the data), but in general it doesn’t seem that this smoother is very useful, except perhaps for identifying certain cities or regions that have elevated levels of giving.

Let’s return to the lowess smooth trace for a moment. In the scatterplot below, I have used a selection tool in DataDesk to highlight the alumni living in two bands of distances from campus who seem to be associated with higher levels of giving. I’ve highlighted those points in green.

Click for full size.

At this point, I’m not even sure where these people live. All I know is that these 650 alumni, as a group, seem to have elevated giving. “Seem” is the operative word, because when I isolate these alumni and actually compare their average lifetime giving to that of everyone else, there is no difference whatsoever. The effect we thought we saw, subtle as it was, vanishes when we look at the numbers. This is probably because I’ve selected too many points, or the wrong points, to isolate the true givers.

We can look at it another way, though. The numbers for average lifetime giving that I mentioned above include the ‘zeroes’ (alumni who gave nothing). What if we looked just at participation, instead of dollars given? When I code everyone as either a donor (1) or a non-donor (0), and put that variable on the Y axis of the scatterplot, the smooth trace is still suggesting the same idea, that alumni in a certain distance band are better donors:

Scatterplot: 'Is donor' and Distance. Click for full size.

When I look at the participation numbers, I see that 36% of alumni in the “green zone” (the prominent hump in the lowess line) are donors, compared with 31% of all other alumni. Again, this is a rather subtle difference and not something I’m going to spend a great deal of time coding variables for.

(Interestingly, Peng Li himself conducted a Google Earth-related project that found that our home city has the highest donor rate among all cities in North America that have a concentration of our alumni. Most other cities have donor rates that are at about the same level, regardless of distance.)

A few conclusions, observations, and ideas for further study:

  • Saying that giving always goes up markedly with distance from campus is probably overstating the case. (Sorry, Jason Fisher.)
  • Certain cities or regions might be characterized by elevated levels of giving, and these might show up on a geocode scatterplot — but there are simpler, more direct ways to isolate these geographic pockets of generous alumni.
  • Trends are probably highly influenced by single high-value donors in areas where few alumni live. Larger sample sizes would help even things out.
  • Considering differences in philanthropic culture between Canada and the U.S., it might be wise to analyze each nation separately.

I’m not aware of any fundraisers making use of GIS data. Although our experiment did not really lead to any predictive breakthroughs, I think this area has great potential. This is “cool data”. Do you have a GIS project in the works? I’d like to hear about it.

18 May 2010

Multi-word terms in text mining

Filed under: Coolness, Free stuff, Text, Text mining — Tags: , — kevinmacdonell @ 11:14 am

Early this year I posted a tutorial on how to do some very basic text-mining of free-text comments. The method I outlined in my post focuses on single words, rather than word combinations or phrases. Here’s a quick way to extract the most common multi-word terms from a batch of text. (Followed by a postscript which offers another free and quick way to do the same thing.)

The National Centre for Text Mining (NaCTeM) bills itself as the “first publicly-funded text mining centre in the world.” The organization provides text mining services to the academic community in the United Kingdom. (NaCTeM is operated by the University of Manchester with close collaboration with the University of Tokyo.)

On their site you’ll find a free service called TerMine, which will extract terms and phrases for you. You can submit text for analysis (up to 2 MB) by pasting it into a window, uploading a file (.txt or .pdf) from your computer, or entering a URL (.html or .pdf). Then select the type of tagging system to use. ‘Tree Tagger version 3.1‘ is most suited to the generic text you’re most likely to be analyzing; the other option, ‘GENIA Tagger version 2.1’ is more appropriate for texts from the bio-medical sciences.

Click “Analyze” to get the results. Below is a sample: Here are the first 15 common terms from the entire CoolData blog.

I don’t know what an “alumnus donor” is, but all of the rest stand up as valid terms. (Funny, I didn’t know that I say “tough job” a lot, but apparently I do.)

All this gives you is the most common terms from your entire set of comments. It doesn’t tell you which terms are linked to which individuals in your data set, or how they are correlated with your DV. That involves a few extra steps, which you can find in my original post on text mining. Have fun!

POSTSCRIPT

P.S. — Another tool you can use to do the same thing as TerMine is called Primitive Word Counter. It’s a free download, and does not require you to upload your sensitive text files to the Net. I’ve just given it a try, and it does a great job for identifying frequently-used words AND whole phrases.

(Click image for large view)

13 April 2010

Exploring your database relationships with NodeXL

Filed under: Coolness, Relationship mapping — Tags: , , , , — kevinmacdonell @ 7:16 am

Guest Post by Jason Boley, Associate Director of Prospect Management and Tracking, Purdue University

NodeXL is a powerful social network analysis tool that is designed as a plug-in to Excel 2007. NodeXL was designed by a team of programmers at Microsoft Research and is now maintained by the Connected Action Consulting Group.  In a nutshell, the tool was designed to produce network graphs from a list of network usage data, and includes built-in tools for getting information from Twitter, Flickr, YouTube, etc. (Link to NodeXL site.)

But can it be useful for fundraising too?

If you look at NodeXL from a fundraising perspective you immediately see the correlation. NodeXL is designed to be a discovery tool for discovering deep connections in your social networks. This in itself is pretty useful, especially if your fundraising department is heavily involved in social media (and it probably will be at some point). This tool helps you discover who is having an impact on your social reputation. But let’s put that aside. Even the most basic of fundraising offices are constantly looking to find new connections to and between donors. Who knows who that can provide me with an introduction to Mr. X? Depending on the size and complexity of your database (and your staff) these connections are often easy to miss. Sure, we may note that Bob and Jim sit on the same board or work at the same company, but going a layer deeper than that is often problematic.

With that in mind, I began to explore NodeXL.

There are a couple of things that are appealing about NodeXL. First, it’s free. It downloads as a zip file and operates essentially as a macro plug-in for Excel 2007. But this is unlike any macro you’ve ever seen. The visual capabilities are stunning. Secondly, it is relatively straightforward. While it does have some extremely powerful advanced features, it’s relatively easy to get started.

The only information that NodeXL accepts is called an edge list. An edge list is essentially an Excel list with two columns. In column A you have the first value and in column B you have a second value which is related to the value in Column A. So a sample edge list might look as follows:

Bob        Jerry
Jerry      Tom
Tom       Bob
Alice      Jerry

These two columns of data are all that NodeXL accepts and needs to do network analysis. The rest of the work, calculating connections between these relationships, is done for you.

I rarely think small, so my first inclination was to see if we could visualize all of the recorded board memberships in our Advance database. The first step was to create my edge list. This was very simple and took the following format:

I_Jason_Boley   O_Jason_Boley_Foundation
I_Bob Smith       O_Bob_Smith_Foundation
I_Joe_Klein        O_Gates_Foundation

I think you can get the picture. The I and O appended to the front are for facilitating sorting (if you should wish to color code them later you can easily sort the organizations from the individuals). As you might expect, this query returned a couple thousand connections. But I was pleased NodeXL handled them so efficiently. The resulting graph (Harel-Koren Fast Multiscale format) was visually stunning.

(Click for full size.)

Stunning? Yes! Useful? I wasn’t sure yet. I knew that I could isolate an individual within this huge graph. The highlighted green line above, for example, is one of our major donors.

I decided to leave this dataset behind for a bit and move to more of a discovery project. We are very fortunate to have completed a full database screening a few months ago. I turned my eyes towards the data we received from our vendor that includes a table of people in our database with connections the vendor discovered to foundations (based on GuideStar data). I created an edge list from this data that resulted in 4,872 relationships. The resulting graph (Fruchterman-Reingold format) returned a huge blob.

(Click for full size.)

You can click around on a graph like this to isolate individual connections (or edges) and try to do some discovery this way, but in a graph as dense of this it is basically impossible. But the other way to do discovery work is to pose the question from a certain perspective. For instance, I highlighted in the edge list two particular closely related foundations. I then used the subgroup option in NodeXL to select all of the vertices four layers deep, meaning let’s look at relationships between individuals not just directly to the organization, but let’s dig a few layers deeper. You can see from the graph below the highlighted connections from the larger mass.

(Click for full size.)

I then extracted these selected vertices into a brand new NodeXL sheet. After adding some labels, you can reduce the useful information to just a few connections.

(Click for full size.)

What can we tell from this graph? It is likely that the connection between Al Winner and Steven Cook was already known and well documented. But the real advantage here is seeing the tangential connections we may have missed. Steven Cook has a potential relationship with Charles Talor. Indeed, in our database Steven Cook is a well known alumnus, while Charles Talor is someone we would like to know. So a potential for establishing a relationship is provided. The same holds true for potential connections to James Jones and Jean Smith. While some of these connections may be known, the advantages to visualizing them are great. Potential connections jump from the page.

What I have discussed is scratching the surface of this advanced tool in only a cursory manner. I have very little knowledge of network graphing beyond what I’ve read online. Yet with NodeXL I have been able to produce some simple but powerful graphs. Relationships as they are defined in fundraising databases are very different from those that can be extracted from complicated social networking data. And yet there are similarities. As I begin to delve deeper into this tool, it is forcing me to rethink how we have relationships defined in our database, think about existing data that we have that might be useful in a visual format, and think about the future where we will certainly want to supplement our traditional alumni data with information from social networks.

(Jason Boley is the Associate Director of Prospect Management and Tracking at Purdue where he manages prospect information and focuses on data analytics. Jason has over ten years of experience working with fundraising databases and has presented nationally on the topics of database reporting and prospect management.)

Stunning? Yes! Useful? I wasn’t sure yet. I knew that I could isolate an individual within this huge graph. The highlighted green line above, for example, is one of our major donors.

I decided to leave this dataset behind for a bit and move to more of a discovery project. We are very fortunate to have completed a full database screening a few months ago. I turned my eyes towards the data we received from our vendor that includes a table of people in our database with connections the vendor discovered to foundations (based on GuideStar data). I created an edge list from this data that resulted in 4,872 relationships. The resulting graph (Fruchterman-Reingold format) returned a huge blob.

12 April 2010

New way to look at words

Filed under: Coolness, Data visualization, Free stuff, Text — Tags: , , , , , — kevinmacdonell @ 8:13 am

Word clouds aren’t new, but there’s a new online app for creating them that is worth checking out. Tagxedo allows you to create your clouds using some versatile tools for shaping the appearance of the cloud, which you can then easily save as a .jpg or .png.

This comes to me via a post on the LoveStats blog, where Annie Pettit has posted a couple of her own creations – one based on the text of her resume, and one on all the words in her blog.

I wrote about word clouds back in December (Quick and easy visuals of large text files), and the well-known and very cool tool known as Wordle, the creation of Jonathan Feinberg. Tagxedo does the same thing but works a little differently. Powered by Microsoft’s SilverLight browser plug-in, Tagxedo offers a nifty interface for importing your text (or URL), finely controlling your word choice, and playing with the font, colour, theme and layout of your cloud, including being able to choose a shape. The choice of shapes is rather limited – hearts, stars, rectangles and ovals, mostly. Here’s a star-shaped word cloud based on the 150 most common words on this blog:

(Click for full size image.)

My interest in word clouds is related to visualization of data – in this context, conveying the gist of a mass of text by giving prominence to the most common significant words. For example, last year I used Wordles to visualize tens of thousands of words entered as free-text comments in a survey of alumni. It’s no substitute for real analysis, but it does make a cool presentation slide!

NOTE: Check in tomorrow for Jason Boley’s amazing work with NodeXL for visualizing prospect connections in your data.

« Newer PostsOlder Posts »

Blog at WordPress.com.