CoolData blog

30 July 2013

Getting bitten by Python

When I was first learning to build predictive models, preparing the data was part of the adventure. In time, though, many operations on the data became standard instead of exploratory. Eventually they became simply repetitive and tedious. When any task becomes repetitive, I think of ways to automate it. Given that data prep makes up 80 percent of the work of building a model (according to some authors), the benefits of automation are obvious.

I can think of only two ways to replicate the manual operations you need to perform on a large data set to make it ready for modelling: Use software specially designed for the task, or code your own data-handling scripts. I am lazy and drawn to software solutions that make hard things easy, and I’m not a programmer. Yet I have veered away from a ready-made software solution to pursue an interest in the scripting language called Python, and in particular the Python code library called pandas, written specifically for working with data.

Maybe it’s because Python is open-source and free, or because it is powerful, or because it is flexible and widely adaptable to multiple uses on the job. I don’t know. But for the past few months I’ve been obsessed with learning to use it, and that’s what I’d like to talk about today.

I’m guessing very few CoolData readers have experience writing scripts for handling data. I know some people who do most of their stats work in the R language or manipulate data in Excel using VBA. But the majority of readers probably consider themselves severely allergic to coding of any kind. I concede that it isn’t for everyone, but look: Just as we don’t need to be professional statisticians to use statistical tools to create value for the business, we don’t need to be computer scientists to write useful scripts that can free up large chunks of time we now spend on routine tasks that bore us.

(If you work with someone in IT or Advancement Services who pulls and reshapes your data for you, they might be especially interested in the idea of learning how to automate your requests. They might also be familiar with Python already.)

I should say here that my aim is not to automate predictive modelling itself. There are Python modules for modelling, too, from the venerable classics such as regression to the latest advanced techniques. But I’m not so much interested in them, not yet at least. Building predictive models is best done hands-on, guided by a human modeler’s expertise and domain knowledge. My main interest is in eliminating a big chunk of the standard rote work so that I can apply the freshest version of myself to the more interesting and creative elements of data exploration and model creation.

So what is Python (and more specifically, pandas) good for?

  • A script or program can execute a series of database queries and join the results in exactly the way you want, allowing you to build very complex structures and incorporate custom aggregations that might be harder to do using your existing querying/reporting tools. For example, let’s say you want to build a file of donors and include columns for date of first and last gift, amount of highest gift, total cash gifts for the past five fiscal years, and percentage of total giving devoted to student financial assistance. Unless IT has built some advanced views for you from the base tables in your database, many of these variables will require applying some calculations to the raw transactional data. I could certainly build a query to get the results for this modest example, but it would involve a few sub-queries and calculated fields. Multiply that by a hundred and you’ve got an idea of how complex a query you’d have to build to deliver a modelling-ready data set. In fact it may be technically impossible, or at least difficult, to build such a single massive query. In Python, however, you can build your data file up in an orderly series of steps. Adding, removing or editing those steps is not a big deal.
  • Python also makes it simple to read data from .csv and Excel files, and merge it painlessly with the data you’ve extracted from your database. This is important to me because not all of my modelling data comes from our database. I’ve got eight years of call centre data results by alumni ID, wealth-related census data by Canadian postal code, capacity data by American ZIP code, and other standalone data sets. Adding these variables to my file used to be a tedious, manual process. In Python, left-joining 20 columns of census data to a file of 100,000 alumni records using Postal Code as the join key takes a single line of code and executes faster than a knight can say “Ni!” (Inside Python joke.)
  • Many other common operations also take only one or two lines of code, including conversion of categorical variables to 0/1 dummy variables, performing transformations and mathematical operations on variables, filling in or imputing missing data with constants or calculated values, pivoting data, and creating new variables from existing ones via concatenation (for strings) or math (for numbers).
  • With a script, you can also iterate over the rows of a data file and perform different operations based on conditional statements.

I’m not going to provide a Python tutorial today (although I’m tempted to do so in the future), but here is a sample line of code from a script, with a description of what it does. This doesn’t give you enough information to do anything useful, but you’ll at least see how compact and powerful the language is.

Skipping some necessary preliminaries, let’s say you’ve just used Python to query your Oracle database to read into memory a data set containing the variables ID, Constituent Category, Sex, and Age for all living constituent persons. (An operation that itself takes little more than two or three lines of code.) Obviously it depends on your database and code structure, but let’s say “Constituent Category” includes codes for such categories as Alumnus/na (ALUM), Non-degreed alumni (ALND), Parent (PRNT), Friend (FRND), Faculty (FCTY), Staff (STAF), and so on. And let’s further assume that a constituent can belong to multiple categories. Most people will have only one code, but it’s possible that a person can simultaneously be an alum, a parent, and a faculty member.

In our script, the data is read into a structure called a DataFrame (a tool provided by the pandas code library). This should sound familiar to users of R in particular. For the rest of us, a DataFrame is very much like a database table, with named columns and numbered (“indexed”) rows. Had we pasted the data into Excel instead, it might look like this:

pivot1

Right away we see that William and Janet are represented by multiple rows because they have multiple constituent codes. This won’t work for predictive modelling, which requires that we have just one row per individual – otherwise certain individuals would carry more weight in the model than they should. You could say that multiple records for Janet means that 60-year-old females are over-represented in the data. We could delete the extra rows, but we don’t want to do that because we’d be throwing away important information that is almost certainly informative of our modelling target, eg. likelihood to make a donation.

In order to keep this information while avoiding duplicate IDs, we need to pivot the data so that each category of Constituent Code (ALUM, PRNT, etc.) becomes its own column. The result we want would look like this in Excel:

pivot2

The Con_Code column is gone, and replaced with a series of columns, each a former category of Con_Code. In each column is either a 0 or 1, a “dummy variable” indicating whether an individual belongs to that constituency or not.

Getting the data from the first state to the final state requires just three lines of code in Python/pandas:

df = pd.merge(df, pd.crosstab(df.ID, df.Con_Code), how='left', left_on='ID', right_index=True)

df = df.drop(['Con_Code'], axis=1)

df = df.drop_duplicates()

This snippet of code may look invitingly simple or simply terrifying – it depends on your background. Whatever – it doesn’t matter, because my point is only that these three lines are very short, requiring very little typing, yet they elegantly handle a common data prep task that I have spent many hours performing manually.

Here’s a brief summary of what each line does:

Line 1: There’s a lot going on here … First, “df” is just the name of the DataFrame object. I could have called it anything. On the right-hand side, you see “pd” (which is shorthand for pandas, the module of code that is doing the work), then “crosstab,” (a function that performs the actual pivot). In the parentheses after pd.crosstab, we have specified the two columns to use in the pivot: df.ID is the data we want for the rows, and df.Con_Code is the column of categories that we want to expand into as many columns as there are categories. You don’t have to know in advance how many categories exist in your data, or what they are – Python just does it.

Pd.crosstab creates a new table containing only ID and all the new columns. That entity (or “object”) is just sitting out there, invisible, in your computer’s memory. We need to join it back to our original data set so that it is reunited with Age, Sex and whatever other stuff you’ve got. That’s what “pd.merge” does. Again, “pd” is just referencing the pandas module that is providing the “merge” function. The operation is called “merge,” but it’s much the same thing as an SQL-type join, familiar to anyone who queries a database. The merge takes two inputs, our original DataFrame (“df”), and the result from the crosstab operation that I described above. The argument called “how” specifies that we want to perform the equivalent of a left-join. A couple of other optional arguments explicitly tell Python which column to use as a join key (‘ID’).

The crosstab operation is enclosed within the merge operation. I could have separated these into multiple lines, which would have been less confusing, but my point is not to teach Python but to demonstrate how much you can accomplish with a trivial amount of typing. (Or copying-and-pasting, which works too!)

We’re not quite done, though. Our merged data is still full of duplicate IDs, because the Con_Code column is still present in our original data.

Line 2 deletes (“drops”) the entire column named Con_Code, and reassigns the altered DataFrame to “df” – essentially, replacing the original df with the new df created by the drop operation.

Now that Con_Code is gone, the “extra” rows are not just duplicates by ID, they are perfect duplicates across the entire row – there is nothing left to make two rows with the same ID unique. We are ready for the final step …

Line 3 deletes (or “drops”) every row that is a duplicate of a previous row.

Having accomplished this, another couple of lines near the end of the script (not shown) will write the data row by row into a new .csv file, which you can then import into your stats package of choice. If you had two dozen different constituent codes in your data, your new file will be wider by two dozen columns … all in the blink of an eye, without any need for Excel or any manual manipulation of the data.

Excel is perfectly capable of pivoting data like we see in the example, but for working with very large data sets and seamlessly merging the pivoted data back into the larger data file, I can’t think of a better tool than Python/pandas. As the data set gets bigger and bigger, the more need there is to stop working with it in tools that go to the extra work of DISPLAYING it. I suppose one of the beauties of Excel is that you can see the data as you are working on it. In fact, as I slowly built up my script, I repeatedly opened the .csv file in Excel just to have that visual inspection of the data to see that I was doing the right thing. But I inevitably reached the point at which the file was just too large for Excel to function smoothly. At 120,000 rows and 185 columns in a 90MB file, it was hardly Big Data – Excel could open the file no problem – but it was large enough that I wouldn’t want to do much filtering or messing with formulas.

On a quick first read, the code in the example above may seem impenetrable to a non-programmer (like me), but you don’t need to memorize a lot of functions and methods to write scripts in Python. Combing the Web for examples of what you want to do, using a lot of cut-and-paste, perhaps referring to a good book now and again – that’s all it takes, really.

That said, it does require time and patience. It took me many hours to cobble together my first script. I re-ran it a hundred times before I tracked down all the errors I made. I think it was worth it, though – every working piece of code is a step in the direction of saving untold hours. A script that works for one task often does not require much modification to work for another. (This cartoon says it all: Geeks and repetitive tasks.)

Beyond data preparation for predictive modelling, there are a number of directions I would like to go with Python, some of which I’ve made progress on already:

  • Merging data from multiple sources into data extract files for use in Tableau … With version 8.0 of the software comes the new Tableau API for building .tde files in Python. This was actually my first experiment with Python scripting. Using the TDE module and a combination of database queries and pandas DataFrames, you can achieve a high degree of automation for refreshing the most complex data sets behind your views and dashboards.
  • Exploring other modelling techniques besides my regular mainstay (regression) … I’ve long been intrigued by stuff such as neural networks, Random Forest, and so on, but I’ve been held back by a lack of time as well as some doubt that these techniques offer a significant improvement over what I’m doing now. Python gives ready access to many of these methods, allowing me to indulge my casual interest without investing a great deal of time. I am not a fan of the idea of automated modelling – the analyst should grasp what is going on in that black box. But I don’t see any harm in some quick-and-dirty experimentation, which could lead to solutions for problems I’m not even thinking of yet.
  • Taking advantage of APIs …. I’d like to try tapping into whatever social networking sites offer in the way of interfaces, and also programmatically access web services such as geocoding via Google.
  • Working with data sets that are too large for high-level applications such as Excel … I recently tried playing with two days’ worth of downloaded geocoded Twitter data. That’s MILLIONS of rows. You aren’t going to be using Excel for that.

I hope I’ve been able to transfer to you some of my enthusiasm for the power and potential of Python. I guess now you’ll be wondering how to get started. That’s not an easy question to answer. I could tell you how to download and install Python and an IDE (an integrated development environment, a user interface in which you may choose write, run, and debug your scripts), but beyond that, so much depends on what you want to do. Python has been extended in a great many directions – pandas for data analysis being just one of them.

However, it wouldn’t hurt to get a feel for how “core Python” works – that is, the central code base of the language along with its data types, object types, and basic operations such as “for” loops. Even before you bother installing anything, go to Codecademy.com and try a couple of the simple tutorials there.

For specific questions Google is your friend, but if you want a reference that covers all the basics in more or less plain English, I like “Learning Python” (4th Edition, but I see there’s a 5th Edition now) by Mark Lutz, published by O’Reilly. Another O’Reilly book, “Python for Data Analysis,” by Wes McKinney, describes how to crunch data with pandas and other related code libraries. (McKinney is the main author of the pandas library.)

I think readers new to programming (like me) will feel some frustration while learning to write their first scripts using any one book or resource. The Lutz book might seem too fine-grained in its survey of the basics for some readers, and McKinney is somewhat terse when offering examples of how various methods work. The problem is not with the books themselves – they’re wonderful. Consider that Python is used in web interfaces, robotics, database programming, gaming, financial markets, GIS, scientific programming, and probably every academic discipline that uses data – you must understand that core texts are perforce very general and abstract. (Think of grammar books for spoken languages.) It’s up to coders themselves to combine the basic building blocks in creative and powerful ways.

That said, after many, many hours spent hopping back and forth between these books, plus online tutorials and Python discussion forums – and just messing around on my own – I have figured out a few useful ways to accomplish some of the more common data preparation tasks that are specific to predictive modelling. Someday I would be happy to share – and, as always, to learn from the experience of others.

2 May 2013

New twists on inferring age from first name

Filed under: Analytics, Coolness, Data Desk, Fun — Tags: , , , — kevinmacdonell @ 6:14 am

Not quite three years ago I blogged about a technique for estimating the age of your database constituents when you don’t have any relevant data such as birth date or class year. It was based on the idea that many first names are typically “young” or “old.” I expanded on the topic in a followup post: Putting an age-guessing trick to the test. Until now, I’ve never had a reason to guess someone’s age — alumni data is pretty well supplied in that department. This very month, though, I have not one but two major modeling projects to work on that involve constituents with very little age data present. I’ve worked out a few improvements to the technique which I will share today.

First, here’s the gist of the basic idea. Picture two women, named Freda and Katelyn. Do you imagine one of them as older than the other? I’m guessing you do. From your own experience, you know that a lot of young women and girls are named Katelyn, and that few if any older women are. Even if you aren’t sure about Freda, you would probably guess she’s older. If you plug these names into babynamewizard.com, you’ll see that Freda was a very popular baby name in the early 1900s, but fell out of the Top 1000 list sometime in the 1980s. On the other hand, Katelyn didn’t enter the Top 1000 until the 1970s and is still popular.

To make use of this information you need to turn it into data. You need to acquire a lot of data on the frequency of first names and how young or old they tend to be. If you work for a university or other school, you’re probably in luck: You might have a lot of birth dates for your alumni or, failing that, you have class years which in most cases will be a good proxy for age. This will be the source you’ll use for guessing the age of everyone else in your database — friends, parents and other person constituents — who don’t have ages. If you have a donor database that contains no age data, you might be able to source age-by-first name data somewhere else.

Back to Freda and Katelyn … when I query our database I find that the average age of constituents named Freda is 69, while the average age for Katelyn is 25. For the purpose of building a model, for anyone named Freda without an age, I will just assume she is 69, and for anyone named Katelyn, 25. It’s as simple as creating a table with two columns (First name and Average age), and matching this to your data file via First Name. My table has more than 13,500 unique first names. Some of these are single initials, and not every person goes by their first name, but that doesn’t necessarily invalidate the average age associated with them.

I’ve tested this method, and it’s an improvement over plugging missing values with an all-database average or median age. For a data set that has no age data at all, it should provide new information that wasn’t there before — information that is probably correlated with behaviours such as giving.

Now here’s a new wrinkle.

In my first post on this subject, I noted that some of the youngest names in our database are “gender flips.” Some of the more recent popular names used to be associated with the opposite gender decades ago. This seems to be most prevalent with young female names: Ainslie, Isadore, Sydney, Shelly, Brooke. It’s harder to find examples going in the other direction, but there are a few, some of them perhaps having to do with differences in ethnic origin: Kori, Dian, Karen, Shaune, Mina, Marian. In my data I have close to 600 first names that belong to members of both sexes. When I calculate average age by First Name separately for each sex, some names end up with the exact same age for male and female. These names have an androgynous quality to them: Lyndsay, Riley, Jayme, Jesse, Jody. At the other extreme are the names that have definitely flipped gender, which I’ve already given examples of … one of the largest differences being for Ainslie. The average male named Ainslie is 54 years older than the average female of the same name. (In my data, that is.)

These differences suggest an improvement to our age-inferring method: Matching on not just First Name, but Sex as well. Although only 600 of my names are double-gendered, they include many popular names, so that they actually represent almost one-quarter of all constituents.

Now here’s another wrinkle.

When we’re dealing with constituents who aren’t alumni, we may be missing certain personal information such as Sex. If we plan to match on Sex as well as First Name, we’ve got a problem. If Name Prefix is present, we can infer from whether it’s Mr., Ms., etc., but unless the person doing the data entry was having an off day, this shouldn’t be an avenue available to us — it should already be filled in. (If you know it’s “Mrs.,” then why not put in F for Sex?) For those records without a Sex recorded (or have a Sex of ‘N’), we need to make a guess. To do so, we return to our First Names query and the Sex data we do have.

In my list of 600 first names that are double-gendered, not many are actually androgynous. We have females named John and Peter, and we have males named Mary and Laura, but we all know that given any one person named John, chances are we’re talking about a male person. Mary is probably female. These may be coding errors or they may be genuine, but in any case we can use majority usage to help us decide. We’ll sometimes get it wrong — there are indeed boys named Sue — but if you have 7,000 Johns in your database and only five of them are female, then let’s assume (just for the convenience of data mining*) that all Johns are male.

So: Query your database to retrieve every first name that has a Sex code, and count up the instance of each. The default sex for each first name is decided by the highest count, male or female. To get a single variable for this, I subtract the number of females from the number of males for each first name. Since the result is positive for males and negative for females, I call it a “Maleness Score” — but you can do the reverse and call it a Femaleness Score if you wish! Results of zero are considered ties, or ‘N’.

At this point we’ve introduced a bit of circularity. For any person missing Age and Sex, first we have to guess their sex based on the majority code assigned to that person’s first name, and then go back to the same data to grab the Age that matches up with Name and Sex. Clearly we are going to get it very wrong for a lot of records. You can’t expect these guesses to hold up as well as true age data. Overall, though, there should be some signal in all that noise … if your model believes that “Edgar” is male and 72 years of age, and that “Brittany” is female and 26, well, that’s not unreasonable and it’s probably not far from the truth.

How do we put this all together? I build my models in Data Desk, so I need to get all these elements into my data file as individual variables. You can do this any way that works for you, but I use our database querying software (Hyperion Brio). I import the data into Brio as locally-saved tab-delimited files and join them up as you see below. The left table is my modeling data (or at least the part of it that holds First Name), and the two tables on the right hold the name-specific ages and sexes from all the database records that have this information available. I left-join each of these tables on the First Name field.

age_tablesWhen I process the query, I get one row per ID with the fields from the left-hand table, plus the fields I need from the two tables on the right: the so-called Maleness Score, Female Avg Age by FName, Male Avg Age by Fname, and N Avg Age by Fname. I can now paste these as new variables into Data Desk. I still have work to do, though: I do have a small amount of “real” age data that I don’t want to overwrite, and not every First Name has a match in the alumni database. I have to figure out what I have, what I don’t have, and what I’m going to do to get a real or estimated age plugged in for every single record. I write an expression called Age Estimated to choose an age based on a hierarchical set of IF statements. The text of my expression is below — I will explain it in plain English following the expression.

if len('AGE')>0 then 'AGE'

else if textof('SEX')="M" and len('M avg age by Fname')>0 then 'M avg age by Fname'
else if textof('SEX')="M" and len('N avg age by Fname')>0 then 'N avg age by Fname'
else if textof('SEX')="M" and len('F avg age by Fname')>0 then 'F avg age by Fname'

else if textof('SEX')="F" and len('F avg age by Fname')>0 then 'F avg age by Fname'
else if textof('SEX')="F" and len('N avg age by Fname')>0 then 'N avg age by Fname'
else if textof('SEX')="F" and len('M avg age by Fname')>0 then 'M avg age by Fname'

else if textof('SEX')="N" and 'Maleness score'>0 and len('M avg age by Fname')>0 then 'M avg age by Fname'
else if textof('SEX')="N" and 'Maleness score'<0 and len('F avg age by Fname')>0 then 'F avg age by Fname'
else if textof('SEX')="N" and 'Maleness score'=0 and len('N avg age by Fname')>0 then 'N avg age by Fname'

else if len('N avg age by Fname')>0 then 'N avg age by Fname'
else if len('F avg age by Fname')>0 then 'F avg age by Fname'
else if len('M avg age by Fname')>0 then 'M avg age by Fname'

else 49

Okay … here’s what the expression actually does, going block by block through the statements:

  1. If Age is already present, then use that — done.
  2. Otherwise, if Sex is male, and the average male age is available, then use that. If there’s no average male age, then use the ‘N’ age, and if that’s not available, use the female average age … we can hope it’s better than no age at all.
  3. Otherwise if Sex is female, and the average female age is available, then use that. Again, go with any other age that’s available.
  4. Otherwise if Sex is ‘N’, and the Fname is likely male (according to the so-called Maleness Score), then use the male average age, if it’s available. Or if the first name is probably female, use the female average age. Or if the name is tied male-female, use the ‘N’ average age.
  5. Otherwise, as it appears we don’t have anything much to go on, just use any available average age associated with that first name: ‘N’, female, or male.
  6. And finally, if all else fails (which it does for about 6% of my file, or 7,000 records), just plug in the average age of every constituent in the database who has an age, which in our case is 49. This number will vary depending on the composition of your actual data file — if it’s all Parents, for example, then calculate the average of Parents’ known ages, excluding other constituent types.

When I bin the cases into 20 roughly equal groups by Estimated Age, I see that the percentage of cases that have some giving history starts very low (about 3 percent for the youngest group), rises rapidly to more than 10 percent, and then gradually rises to almost 18 percent for the oldest group. That’s heading in the right direction at least. As well, being in the oldest 5% is also very highly correlated with Lifetime Giving, which is what we would expect from a donor data set containing true ages.

est_age_vingt

This is a bit of work, and probably the gain will be marginal a lot of the time. Data on real interactions that showed evidence of engagement would be superior to age-guessing, but when data is scarce a bit of added lift can’t hurt. If you’re concerned about introducing too much noise, then build models with and without Estimated Age, and evaluate them against each other. If your software offers multiple imputation for missing data as a feature, try checking that out … what I’m doing here is just a very manual form of multiple imputation — calculating plausible values for missing data based on the values of other variables. Be careful, though: A good predictor of Age happens to be Lifetime Giving, and if your aim is to predict Giving, I should think there’s a risk your model will suffer from feedback.

* One final note …

Earlier on I mentioned assuming someone is male or female “just for the convenience of data mining.”  In our databases (and in a conventional, everyday sense too), we group people in various ways — sex, race, creed. But these categories are truly imperfect summaries of reality. (Some more imperfect than others!) A lot of human diversity is not captured in data, including things we formerly thought of as clear-cut. Sex seems conveniently binary, but in reality it is multi-category, or maybe it’s a continuous variable. (Or maybe it’s too complex for a single variable.) In real life I don’t assume that when someone in the Registrar’s Office enters ‘N’ for Sex that the student’s data is merely missing. Because the N category is still such a small slice of the population I might treat it as missing, or reapportion it to either Male or Female as I do here. But that’s strictly for predictive modeling. It’s not a statement about transgendered or differently gendered people nor an opinion about where they “belong.”

13 December 2011

Finding connections to your major gift prospects in your data

Guest post by Erich Preisendorfer, Associate Director, Business Intelligence, Advancement Services, University of New Hampshire

(Thanks to Erich for this guest post, which touches on something a lot of prospect researchers are interested in: mapping relationships to prospects in their database. Actually, this work is more exciting than that, because it actually helps people find connections they may not have known about, via database queries and a simple scoring system. Is your Advancement Services department working on something like this? Why not ask them? — Kevin.)

Data miners often have an objective of exploring sets of data to determine meaningful patterns which can then be modeled for predictive patterning, hopefully to help meet their organization’s end goal(s).  However, there may be a time when the end behavior is not inherent in your database. Such a situation recently came up for my Advancement organization.

Our prospecting team recently started a program wrapped around peer recommendations: A prospect recommends new suspects to us based on the prospect’s interactions with the suspects. The question then became, what can we provide to the prospect to help get them thinking about potential suspects?

We currently do not have any type of data which would allow us to say, “Yes, this is what a relationship looks like,” outside of family relationships. We had to find a different way to identify potential acquaintances. I looked back at my own relationships to determine how I know the people I know. My friends and acquaintances largely come from some basic areas: school, work, places I’ve gone, etc.

Transforming my experience with relationships into what we have for useable data, I saw three key areas where relationships may exist: work history, education history, and extracurricular activities including one-time events. Fortunately, I was able to pinpoint our constituents’ time in each of the above areas to help isolate meaningful, shared experiences amongst constituents. Our work and extracurricular history includes to/from dates, and we have loads of educational history data that includes specific dates. Using this data, I am able to come up with potential relationships from a single prospect.

Prospect Profile (generated by entering a single prospect’s ID):

  • John Adams
  • Widget Factory, Employee 01/05/1971 – 06/16/1996
  • Student Activities: Football, Student Senate 09/1965-05/1966
  • Bachelor of Arts, Botany 1966

Potential Relationships (each item below is a separate query, using the Prospect Profile results):

  • Those employed by the Widget Factory who started before John ended, and ended after John began.
  • Those students who participated in Football and had a class year within +/-3 years of John.
  • Those students in Student Senate at the same time as John, similar to the Widget Factory example.
  • Those students who were in the same class year as John.
  • Those students who share John’s major.

Currently,since I have no way of proving the value of one point of contact over the other, each row returned in the potential relationships earns the constituent one point. Since my database stores historical records, I may get more than one row per constituent in any one category if they met more than one of John’s associated records – say they participated in Student Senate and played Football. This is great, because I want to give those particular constituents two points since they have more than one touch point in common with John.

I end up with a ranked list of constituents who share potential relationship contacts with my main prospect. The relationship lists provide our prospect researchers a starting point in putting together a solid list of high capacity constituents a single person may have some sort of relationship with, thus a greater insight into potential giving.

As of now, the report is in its infancy but looks to have high potential. As we grow the concept, there are multiple data points where further exploration could result in a higher level of functioning. As prospects use the lists to identify people they know, we can then deconstruct those choices to determine what is more likely a relationship. Should shared employment be ranked higher than shared class year? Should Football rank higher than Student Senate? I would guess yes, but I currently do not have supporting data to make that decision.

Another interesting concept, raised at the recent DRIVE 2011 conference, would be: “How are these two prospects potentially related by a third constituent?”  The result could mean the difference between two separate, forced conversations and one single conversation with three prospects shared over nostalgic conversations, drinks and, hopefully, money in the door!

Erich Preisendorfer is Associate Director, Business Intelligence, working in Advancement Services at the University of New Hampshire.

17 December 2010

Chart frequency of words and terms from books with Google

Filed under: Coolness, Free stuff, Fun — Tags: , — kevinmacdonell @ 11:46 am

Here’s another interesting bauble from the nerds at Google. The Books Ngram Viewer allows you to plot the frequency of words and phrases that appeared in books published in the past few hundred years. Google estimates they’ve scanned and OCR’d more than 10 percent of all the books ever published, and this plotter is based on a sample of that data.

This “most excellent time-wasting tool” was blogged about by Alexis Madrigal, a senior editor for TheAtlantic.com, in his post, The Decline of Man (as a Word), in which he shows how the word “man” has fared against “woman”. (Not well.) As Madrigal observes, this may not serve a legitimate research purpose, but it sure is fun.

Here’s a sample. I’ve searched for the term “database”, and set the years to search as 1950 to 2008. The y-axis shows the percentage of all the terms contained in Google’s sample of books written in English that are “database” for those years. As you can see, the word didn’t emerge in published sources before the early 1970s. (Click image for full size.)

The tool also allows you to plot the progress of one term against another. If you plot “database” against “data base”, you’ll see that the two-word term enjoyed a short life before the single word took over. I’ve been interested in the use of the word “gift” instead of “donation,” but the plot of those two words isn’t very informative due, I guess, to the many connotations of the word “gift.” Instead I plotted “charitable gift” and “charitable donation” to put the words in context, and came up with this chart. The concept of giving seems to have had quite a heyday up until around 1835, and “donation” was firmly in the lead. By 1880, though, it was all about the gift. (Click image for full size.)

That got me thinking about how well “philanthropy” has done through the years. Mentions before 1750 are rare, so I plotted from then to the present, and once again the first half the 19th century seems to have been relatively more preoccupied with the idea than later on. (Although, of course, who knows what data this is really based on. As I said, it’s fun, but I wouldn’t want to base a thesis on it without knowing more about the underlying data.)

Hmm – this IS fun. What if we plot poverty vs. religion vs. education? This doesn’t tell us what people were giving to, but it does give a glimpse into what they were writing about. “Poverty” has stayed relatively constant since 1750, but look at how “religion” has declined as “education” has risen. One line crosses the other right at 1909. Also interesting is that the trend started reversing direction about 10 years ago.

And finally, this chart plots “data mining” and two variations of “fundraising“. Data mining takes off as a published term in the early 1990s, and the term “fund raising” has merged into the single word, “fundraising.”

All sorts of fun. Try some for yourself! I’d be interested in hearing about any cool combos you come up with that relate to analytics and/or fundraising.

9 July 2010

How to infer age, when all you have is a name

Filed under: Coolness, External data, Non-university settings, Predictor variables — kevinmacdonell @ 6:02 am

I rarely post on a Friday, let alone a Friday in the middle of summer, but today’s cool idea is somewhat half-baked. Its very flakiness suits the day and the weather. Actually, I think it has potential, but I’m interested to know what others think.

For those of us in higher-ed fundraising, ‘age’ or ‘class year’ is a key predictor variable. Not everyone has this information in their databases, however. What if you could sort of impute a “best guess” age, based on a piece of data that you do have: First name?

Names go in and out of fashion. You may have played around with this cool tool for visualizing baby-name trends. My own first name, Kevin, peaked in popularity in the 1970s and has been on a downward slide ever since (chart here). I was born in 1969, so that’s pretty close. My father’s name, Leo, has not been popular since the 1920s (he was born in 1930), but is having a slight comeback in recent years (chart here).

As for female names, my mother’s name, Yvonne, never ranked in the top 1,000 in any time period covered by this visualization tool, so I’ll use my niece’s name: Katelyn. She was born in 2005. This chart shows that two common spellings of her name peaked around that year. (The axis labeling is a bit wonky — you’ll have to hover your cursor over the display to get a good read on the timing of the peak.)

You can’t look up every first name one by one, obviously, so you’ll need a data set from another source that relates relative frequencies of names with age data. That sort of thing might be available in census data. But knowing somebody with access to a higher-ed database might be the easiest way.

I’ve performed a query on our database, pulling on just three fields: ID (to ensure I have unique records), First Name, and Age — for more than 87,000 alumni. (Other databases will have only Class Year — we’re fortunate in that we’ve got birth dates for nearly every living alum.) Here are a few sample rows, with ID number blanked out:

From here, it’s a pretty simple matter to copy the data into stats software (or Excel) to compute counts and median ages for each first name. Amazingly, just six first names account for 10% of all living, contactable alumni! (In order: John, David, Michael, Robert, James, and Jennifer.)

On the other hand, a lot of first names are unique in the database, or nearly so. To simplify things a bit, I calculated median ages only for names represented five or more times in the database. These 1,380 first names capture the vast majority of alumni.

The ten “oldest” names in the database are listed in the chart below, in descending order by median age. Have a look at these venerable handles. Of these, only Max has staged a rebound in recent years (according to the Baby Names visualizer).

And here are the ten “youngest names,” in ascending order by median age. It’s an interesting coincidence that the very youngest name is Katelyn — my five-year-old niece. One or two (such as Jake) were popular many years ago, and at least one has flipped gender from male to female (Whitney). Most of the others are new on the scene.

The real test is, do these median ages actually provide reasonable estimates of age for people who aren’t in the database?

I’m not in the database (as an alum). There are 371 Kevins in the database, and their median age is 43. I turned 41 in May, so that’s very good.

My father is also not an alum. The 26 Leos in the database have a median age of 50, which is unfortunately 30 years too young. Let’s call that one a ‘miss’.

My mother’s ‘predicted’ age is off by half that — 15 years — that’s not too bad.

Here’s how my three siblings’ names fare: Angela (predicted 36, actual 39 — very good), Paul (predicted 48, actual 38 — fair), and Francis (predicted 60, actual 36 — poor). Clearly there’s an issue with Francis, which according to the Baby Names chart tool was popular many decades ago but not when my brother was named. In other words, results for individuals may vary!

So let’s say you’re a non-profit without access to age data for your database constituents. How does this help you? Well it doesn’t — not directly. You will need to find a data partner at a university who will prepare a file for you, just as I’ve done above. When you import the data into your model, you can match up records by first name and voila, you’ve got a variable that gives you a rough estimate of age. (Sometimes very rough — but it’s better than nothing.)

This is only an idea of mine. I don’t know if anyone has actually done this, so I’d be interested to hear from others. Here are a few additional thoughts:

  • There shouldn’t be any privacy concern — all you want is a list of first names and their median ages, NOT IDs or last names — but be sure to get all necessary approvals.
  • To anticipate your question, no, I won’t provide you my own file. I think you’d be much better off getting a names file from a university in your own city or region, which will provide a more accurate reflection of the ethnic flavour of your constituency.
  • I used “First name”, but of course universities collect first, middle and last names, and the formal first name might not be the preferred moniker. If the university database has a “Preferred first name” field that is fully populated, that might be a better option for matching with your first-name field.
  • Again, there might be more accessible sources of name- or age-related data out there. This idea just sounded fun to try!

17 June 2010

Is distance from campus correlated with giving?

Filed under: Alumni, Coolness, Predictor variables — Tags: , — kevinmacdonell @ 6:26 am

I’ve long been intrigued by the idea that there might be a correlation between distance from campus and giving to the university, as some people insist there is. And I’m doubly intrigued by the possibility that the correlation is not linear, but roughly defined by some kind of curved function. I’ve had no easy way to figure this out — until I enlisted the help of someone smarter than me.

The idea that there’s an association can be found in many sources. Phonathon expert and consultant Jason Fisher asserts that, possibly without exception, alumni participation in the annual fund increases with distance from campus (link). As far back as 1992, Wesley Lindahl and Christopher Winship reported that three studies showed no predictive effect of distance from the school, one indicated that living farther away was a predictor, and one that living closer was a predictor (link).

The notion that the correlation may be curvilinear is suggested by something Ray Satterthwaite of Engagement Analysis Inc. has told me. His extensive studies of alumni engagement reveal a pattern: engagement is often high in the vicinity of campus, takes a dip in the middle distances (100 to 250 km out, for instance), then increases markedly at the farthest distances. Not all schools follow the same pattern, but this trend is typical. If engagement follows a curve based on distance, then giving probably does, too.

How can you figure this out? And of what use would this information be?

To start with the first question: In order to analyze ‘distance from campus’ we need some way to append a figure for ‘distance’ to every individual in our data set, in order to see how that figure relates to giving. By ‘distance’ I mean direct or “as the crow flies,” not by driving distance. So we’re talking about measuring from one point on the globe to another; the first point will be where the alum’s primary address is located, and the second point is your campus.

Points on the globe are expressed in degrees of latitude and longitude. Once you’ve figured out everyone’s latitude and longitude (including that of your own campus), it’s possible to calculate the distance between the two points.

Here’s where I had to enlist help. If you work at a larger institution, you may be lucky enough to have academic researchers or even staff members working in the field of mapping and GIS (geographical information systems). Our department happens to have a web developer and programmer named Peng Li, and I was vaguely aware that he was working with geographic information. So I asked him if he could help me with a little experiment; if I sent him a batch of North American alumni addresses, could he geocode them for me and send me back distance information? (“Geocode” is the fancy term for determining location data — i.e., latitude and longitude — from other geographic data such as addresses at the street level.)

Peng told me to send him a sample, so I randomly selected about 1,500 American and Canadian addresses. The address data consisted of an Excel file containing nothing but Banner ID and Street, City, State/Prov, and Postcode/ZIP. I included a column for Country, as I suspected Canadian and U.S. addresses might be geocoded differently.

I expected that many addresses wouldn’t geocode properly and would be rejected (post office boxes and so on), but the file I received back had 100% of the addresses geocoded. When I sorted by distance, alumni from just down the street (0.21 km away) were at the top of the list, and alumni out in Alaska (5,626 km away) were at the bottom.

Being a programmer, and just helpful in general, Peng went to the trouble of writing a PHP script to handle the data. The script accessed the Google geocoding service, sent the address data file to Google one row at a time, and created an output file with  fields for latitude, longitude and distance. The Google service has a limit of 2,500 addresses every 24 hours, so Peng also built in the ability to access the Yahoo service, which has a limit of 5,000 per day. Now, whenever I want, I can use Peng’s uploader to send batches of addresses to either service. Batch-geocoding every alumnus and alumni is going to take some time, though: Peng advised me that every input data file should be less than 1,000 records in size, to avoid long waits and server errors.

But what about that initial random sample of 1,500 test addresses?

First, I matched up the sample distance data with lifetime giving for each person. The scatterplot below shows lifetime giving on the Y axis and distance from campus on the X axis. The values for both giving and distance are log-transformed, otherwise the plot would look like a giant clump of points and there would be no visible correlation. (Both giving and distance have distributions that are skewed to very low values, with relatively few very large values — perfect for logarithmic transformation.)

Scatterplot: Lifetime giving and Distance from campus. Click for full size.

The points appear to be arranged in vertical bands – probably due to geography (the blank areas are distance bands that contain relatively few alumni). The solid bar of points at the bottom are all non-donors. Aside from that, it would be difficult to discern any pattern, given how many points overlap, without adding a regression line, which we’ve done here. The gentle upward slope of the line indicates a relatively weak positive linear correlation.

Pearson’s r for the two variables is 0.038. That puts ‘log(distance)’ somewhere in the middle of the pack among predictor variables for strength of correlation with ‘log(giving)’. That’s not bad, but it’s not wildly exciting. Nor is it really worth going to a lot of trouble to acquire this data for predictive purposes, especially considering that other, more easily-obtained variables related to geography might work just as well.

But what about non-linear patterns? The stats software I use, called DataDesk, allows me to lay another type of line over my data. “Smoothers” attempt to trace trends in data by following a calculated path among the dots in a scatterplot. This line-fitting exercise is useful for spotting trends or correlations in sequential data which would otherwise remain hidden in a haze of data points. The most common example of data in sequence is time-series data, but any data that is somehow ordered can be smoothed. “Distance from campus” is a good candidate.

There are different types of smoothers. Some try to trace every drop and uptick in the data, and produce very spikey lines. Others are less sensitive to spikes, and produce wiggly lines which are still faithful to the underlying data. The least sensitive type is called a “lowess” smoother. I won’t go into how it’s calculated; it’s enough to know that the result is a wavy curve that shows the overall pattern in data, and I found that it gave the most comprehensible picture of how distance relates to giving.

Scatterplot with lowess line (smoother) added. Click for full size.

The effect still seems rather subtle, but this looks like what I was expecting: Variations in giving with distance from campus. As with measures of alumni engagement at other schools, giving seems to start at a higher level near the campus, then dips in the middle-distance before rising again at farther distances (with a leveling-off after that).

What about other types of smoothers? DataDesk offers three: lowess, trewess and median smoothers. Here is the scatterplot with a trewess line added.

Scatterplot with trewess smoother. Click for full size.

This smoother is more highly influenced by extreme points in the data, so it’s not as gentle as the lowess line, and there are dramatic spikes wherever a single point exerts undue influence. This is partly a result of having a very small sample (leading to “thin spots” in the data), but in general it doesn’t seem that this smoother is very useful, except perhaps for identifying certain cities or regions that have elevated levels of giving.

Let’s return to the lowess smooth trace for a moment. In the scatterplot below, I have used a selection tool in DataDesk to highlight the alumni living in two bands of distances from campus who seem to be associated with higher levels of giving. I’ve highlighted those points in green.

Click for full size.

At this point, I’m not even sure where these people live. All I know is that these 650 alumni, as a group, seem to have elevated giving. “Seem” is the operative word, because when I isolate these alumni and actually compare their average lifetime giving to that of everyone else, there is no difference whatsoever. The effect we thought we saw, subtle as it was, vanishes when we look at the numbers. This is probably because I’ve selected too many points, or the wrong points, to isolate the true givers.

We can look at it another way, though. The numbers for average lifetime giving that I mentioned above include the ‘zeroes’ (alumni who gave nothing). What if we looked just at participation, instead of dollars given? When I code everyone as either a donor (1) or a non-donor (0), and put that variable on the Y axis of the scatterplot, the smooth trace is still suggesting the same idea, that alumni in a certain distance band are better donors:

Scatterplot: 'Is donor' and Distance. Click for full size.

When I look at the participation numbers, I see that 36% of alumni in the “green zone” (the prominent hump in the lowess line) are donors, compared with 31% of all other alumni. Again, this is a rather subtle difference and not something I’m going to spend a great deal of time coding variables for.

(Interestingly, Peng Li himself conducted a Google Earth-related project that found that our home city has the highest donor rate among all cities in North America that have a concentration of our alumni. Most other cities have donor rates that are at about the same level, regardless of distance.)

A few conclusions, observations, and ideas for further study:

  • Saying that giving always goes up markedly with distance from campus is probably overstating the case. (Sorry, Jason Fisher.)
  • Certain cities or regions might be characterized by elevated levels of giving, and these might show up on a geocode scatterplot — but there are simpler, more direct ways to isolate these geographic pockets of generous alumni.
  • Trends are probably highly influenced by single high-value donors in areas where few alumni live. Larger sample sizes would help even things out.
  • Considering differences in philanthropic culture between Canada and the U.S., it might be wise to analyze each nation separately.

I’m not aware of any fundraisers making use of GIS data. Although our experiment did not really lead to any predictive breakthroughs, I think this area has great potential. This is “cool data”. Do you have a GIS project in the works? I’d like to hear about it.

Older Posts »

The Silver is the New Black Theme. Create a free website or blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 1,082 other followers