CoolData blog

7 July 2014

Mine your donor data with this baseball-inspired analysis

I’ve got baseball analytics on my mind. I don’t know if it’s because of the onset of July or because of a recent mention of CoolData on Nate Silver’s FiveThirtyEight blog, but I have been deeply absorbed in an analysis of donor giving behaviours inspired by Silver’s book, “The Signal and the Noise.” It might give you some ideas for things to try with your own database.

Back in 2003, Silver designed a system to predict the performance of Major League Baseball players. The system, called PECOTA, attempts to understand how a player’s performance evolves as he ages. As Silver writes in his book, its forecasts were probabilistic, offering a range of possible outcomes for each player. From the previous work of others, Silver was aware that hitters reach their peak performance at age 27, on average. Page 81 of his book shows the “aging curve” for major league hitters, a parabola starting at age 18, arcing smoothly upwards through the 20s, peaking at 27, and then just as smoothly arcing downwards to age 39.

My immediate thought on reading about this was, what about donors? Can we visualize the trajectory of various types of donors (major donors, bequest donors, leadership annual fund donors) from their first ten bucks right after graduating, then on into their peak earning years? What would we learn by doing that?

In baseball, the aging curve presents a problem for teams acquiring players with proven track records. By the time they become free agents, their peak years will have passed. However, if the early exploits of a young prospect cause him to resemble one of the greats from the past, perhaps he is worth investing in. The curve, as Silver notes, is only an average. Some players will peak far earlier, and some far later, than the average age of 27. There are different types of players, and difference types of curves, and every individual career is different. But lurking in all that noise, there is a signal.

Silver’s PECOTA system takes things further than that — I will return to that later — but now I want to turn to how we can visualize a sort of aging curve for our donors.

What’s the payoff? Well, to cut to the chase: It appears that some donors who go on to give in six figures (lifetime total) can be distinguished from the majority of lower-level donors at a very early age. Above-average giving (\$200 or \$250, say) in any one year during one’s late 20s or early 30s is a predictor of very high lifetime value. I have found that when big donors have started their giving young, they have started big. That is, “big” in relation to their similarly-aged peers – not at thousands of dollars, but at \$100 or \$500, depending on how old they were at the time.

Call it “precocious giving”.

Granted, it sounds a bit like plain common sense. But being analytical about it opens up the possibility of identifying a donor with high lifetime value when they’re still in their late 30s or early 40s. You can judge for yourself, but the idea appealed to me.

I’m getting ahead of myself. At the start of this, I was only interested in getting the data prepared and plotting the curve to see what it would look like. To the extent that it resembled a curve at all, I figured that “peak age” — the counterpart to baseball player performance — would be the precise age at which a donor gave the most in any given year while they were alive.

~~~~~~~

I wrote a query to pull all donors from the database (persons only), with a row for each year of giving, summing on total giving in that year. Along with Year of Gift, I pulled down Year of Birth for each donor — excluding anyone for whom we had no birthdate. I included only amounts given while the donor was living; bequests were excluded.

The next step was to calculate the age the donor was at the time of the gift. I added a column to the data, defined as the Year of Gift minus Year of Birth. That gave me a close-enough figure for age at time of giving.

As I worked on the analysis, I kept going back to the query to add things I needed, such as certain donor attributes that I wanted to examine. Here are most of the variables I ended up pulling from the database for each unique combination of Donor ID and Age at Time of Gift:

• ID
• Age at Time of Gift (Year of Gift minus Year of Birth)
• Sum of Giving (total giving for that donor, at that age)
• Donor Category Code (Alum, Friend, etc.)
• Total Lifetime Giving (for each donor, without regard to age)
• Deceased Indicator (are they living or dead as of today)
• Current Age (if living, how old are they right now)
• Year of Birth

The result was a data set with more than 200,000 rows. Notice, of course, that a donor ID can appear on multiple rows — one for each value of Age at Gift. The key thing to remember is that I didn’t care what year giving occurred, I only wanted to know how old someone was when they gave. So in my results, a donor who gave in 1963 when she was 42 is much the same as a donor who gave in 2013 he was the same age.

Now it was time to visualize this data, and for that I used Tableau. I connected directly to the database and loaded the data into Tableau using custom SQL. ‘Age at Gift’ is numerical, so Tableau automatically listed that variable in the Measures panel. For this analysis, I wanted to treat it as a category instead, so I dragged it into the Dimensions panel. (If you’re not familiar with Tableau, don’t worry about these application-specific steps — once you get the general idea, you can replicate this using your tool of choice.)

The first (and easiest) thing to visualize was simply the number of donors at each age. Click on the image below to see a full-size version. Every part of the shape of this curve says something interesting, I think, but the one thing I have annotated here is the age at which the largest number of people chose to make a gift.

This chart lumps in everyone — alumni and non-alumni, living donors and deceased donors — so I wanted to go a little deeper. I would expect to see a difference between alumni and non-alumni, for example, so I put all degree and non-degree alumni into one category (Alumni), and all other donor constituents into another (Non-alumni). The curve does not change dramatically, but we can see that the number of non-alumni donors peaks later than the number of alumni donors.

There are a number of reasons for analyzing alumni and non-alumni separately, so from this point on, I decided to exclude non-alumni.

The fact that 46 seems to be an important age is interesting, but this probably says as much about the age composition of our alumni and our fundraising effort over the years as it does about donor behaviour. To get a sense of how this might be true, I divided all alumni donors into quartiles (four bins containing roughly equal numbers of alumni), by Birth Year. Alumni donors broke down this way:

1. Born 1873 to 1944: 8,101 donors
2. Born 1945 to 1955: 8,036 donors
3. Born 1956 to 1966: 8,614 donors
4. Born 1967 to 1991: 8,172 donors

Clearly these are very different cohorts! The donors in the middle two quartiles were born in a span of only a decade each, while the span of the youngest quartile is 24 years, and the span of the oldest quartile is 71 years! When I charted each age group separately, they split into distinct phases. (Reminder: click on the image for a full-size version.)

This chart highlights a significant problem with visualizing the life cycle of donors: Many of the donors in the data aren’t finished their giving careers yet. When Nate Silver talks about the aging curves of baseball players, he means players whose career is behind them. How else to see their rise, peak, and eventual decline? According to the chart above, the youngest quartile peaks (in terms of number of donors) at age 26. However, most of these donors are still alive and have many years of giving ahead of them. We will turn to them to identify up-and-coming donors, but as long as we are trying to map out what a lifetime of giving looks like, we need to focus on the oldest donors.

An additional problem is that our donor database doesn’t go back as far as baseball stats do. Sure, we’ve got people in the database who were born more than 140 years ago, but our giving records are very sparse for years before the early 1970s. If a donor was very mature at that time, his apparent lack of giving history might cause us to make erroneous observations.

I decided to limit the data set to donors born between 1920 and 1944. This excludes the following donors who are likely to have incomplete giving histories:

• Anyone who was older than 50 in 1970, when giving records really started to get tracked, and
• Anyone who is currently younger than 70, and may have many years of giving left.

This is a bit arbitrary, but reasonable. It trims off the donors who could never have had a chance to have a lifetime of giving recorded in the data, without unduly reducing the size of my data set. I was left with only 20% of my original data, but still, that’s more than 6,000 individuals. I could have gotten fussier with this, removing anyone who died at a relatively young age, but I figured the data was good enough to provide some insights.

The dramatic difference made by this trimming is evident in the following two charts. Both charts show a line for the number of donors by age at time of gift, for each of three lifetime giving levels: Under \$1,000 in blue, \$1,000 to \$10,000 in orange, and over \$10,000 in purple. What this means is that all the donors represented by the purple line (for example) gave at least \$10,000 cumulatively over the course of their lifetime.

The first chart is based on ALL the data, before trimming according to birth year. The second chart is based on the 6,000 or so records I was left with after trimming. The first chart seems to offer an interesting insight: The higher the lifetime value of donors, the later in life they tend to show up in great numbers. But of course this just isn’t true. Although the number of donors with lower lifetime giving peaks at earlier ages, that’s only because that whole group of donors is younger: They’re not done giving yet. (I have added ‘Median Current Age’ to the high point of each curve to illustrate this.) Remember, this chart includes everyone — it’s the “untrimmed” data:

Contrast that three-phase chart with this next one, based on “trimmed” data. The curves are more aligned, presumably because we are now looking at a better-defined cohort of donors (those born 1920 to 1944). The oldest donor is 24 years older than the youngest donor, but that’s okay: The most important concern is having good data for that range of ages. Because the tops of these curves are flatter, I have annotated more points, for the sake of interest.

These curves are pretty, but they aren’t analogous to “performance curves” for baseball players — we haven’t yet looked at how MUCH donors give, on average, at each age. However, what general observations can we make from the last chart? Some that come to my mind:

• Regardless of what a donor finally ends up giving lifetime, there are always a few (a very few) who start giving while they are in their 20s, and a few who are still around to give when they are in their late 80s and early 90s.
• The number of donors starts to really take off at around age 40, and there is steady growth until about age 50, when the growth in number of donors begins to slow or plateau.
• Donors start to drop out rapidly at around age 70. This is due to mortality of course, but probably the steepness of the drop is exaggerated by my trimming of the data at the older end.

~~~~~~~

Here is where things really get interesting. The whole point of this exercise was to see if we can spot the telltale signs of a future major donor while they are still relatively young, just as a baseball scout looks for young prospects who haven’t peaked yet. Do donors signal unusual generosity even when they are still in their 20s and 30s? Let’s have a look.

I zoomed in on a very small part of the chart, to show giving activity up until age 35. Are there differences between the various levels of donors? You bet there are.

As soon as a high-lifetime-value donor starts to give, the gifts are higher, relative to same-age peers who will end up giving less. The number of donors at these early ages is miniscule, so take this with a grain of salt, but a trend seems unmistakable: Up to the age of 30, donors who will end up giving in five figures and higher give about 2.5 to 3.5 times as much per year as other donors their age who end up giving \$1,000 to \$10,000 lifetime. AND, they give FIVE TIMES as much per year as other donors their age who end up giving less than \$1,000 lifetime.

Later on, at ages 35 and 40, donors who will finish their giving careers at the high end are giving two to three times as much per year as donors in the middle range, and 5.6 to 7 times per year (respectively) as donors who will finish on the lowest end.

It might be less confusing to chart each group of donors by average giving per year, rather than by number of donors. This chart shows average giving per year up until age 65. Naturally, the averages get very spiky, as donors start making large gifts.

To temper the effect of extreme values, I log-transformed the giving amounts. This made it easier to visualize how these three tiers of donors differ from each other over a lifetime of giving:

What do I see from this? These are generalizations based on averages, but potentially useful generalizations:

• Upper-end donors start strong relative to other donors, accelerate giving after age 40, and continue to increase giving throughout their lifetimes.
• Middle- and low-range donors start lower. They also increase their yearly giving until their late 40s, but after that, they plateau and stay at the same level for the rest of their lives.

~~~~~~~

What’s the bottom line here? I think it’s this: Hundreds of donors were well on their way to being exceptional by the tender age of 40, and a few were signaling long before that.

Information like this would be interesting to Annual Fund as they work to identify prospects for leadership-level giving. But \$10,000 in a lifetime is a little too low to make the Major Gifts folks take notice. Can we carve out the really big donors from the \$10K-plus crowd? And can we also identify them before they hit 40? Have a look at this chart. For this one, I removed all the donors who gave less than \$10,000 lifetime, and then I divided the high-end donors into those who gave less than \$100,000 lifetime (green line) and those who gave more than \$100,000 (red line).

The lines get a bit jagged, but it looks to me like the six-figure lifetime donors pull away from the five-figure donors while still in their 40s. And notice as well that they increase their giving after age 65, which is very unusual behaviour: By 65, the vast majority of donors have either long plateaued or are starting to wind down. (You can’t see this in the chart, but that post-65 group of very generous donors numbers about 50 individuals, with yearly average giving ranging around \$25,000 to \$50,000.)

When I drill down, I can see about a hundred donors sitting along the red line between the ages of 30 and 45, whom we might have identified as exceptional, had we known what to look for.

With the benefit of hindsight, we are now able to look at current donors who were born more recently (after 1969, say), and identify who’s sending out early signals. I have those charts, but I think you’ve seen enough, and as I have said many times in the past: My data is not your data. So while I can propose the following “rules” for identifying an up-and-comer, I don’t recommend you try applying them to your own situation without running your own analysis:

• Gave more than \$200 in one year, starting around age 28.
• Gave more than \$250 in one year, starting around age 29.
• Gave more than \$500 in one year, starting around age 32.

Does this mean I think we can ask a 32-year-old for \$10,000 this year? No. It means that this 32-year-old is someone to watch out for and to keep engaged as an alum. It’s the donors over 50 or so who have exhibited these telltale patterns in their early giving that might belong in a major gift prospect portfolio.

Precocious giving certainly isn’t the only indicator of a good prospect, but along with a few other unusual traits, it is a good start. (See: Odd but true findings? Upgrading annual donors are “erratic” and “volatile”.)

~~~~~~~

Where do you go from here? That is completely up to you. I am still in the process of figuring out how to best use these insights.

Coming up with some rules of thumb, as above, is one way to proceed. Another is rolling up all of a donor’s early giving into a single score — a Precocity Score — that takes into account both how much a donor gave, and how young she was when she gave it. I experimented with a formula that gave progressively higher weights to the number of dollars given for younger ages. For example, \$100 given at age 26 might be worth several times more than \$200 given at age 44.

Using my data set of donors with a full life cycle of giving, I tested whether this score was predictive of lifetime value. It certainly was. However, I also found that a simple cumulative sum of a donor’s giving up to age 35 or 40 was equally as predictive. There didn’t seem to be any additional benefit to giving extra weight to very early giving.

I am shying away from using giving history as an input in a predictive model. I see people do this all the time, but I have always avoided the practice. My preference is to use some version of the rules above as just one more tool to use in prospect identification, distinct from other efforts such as predictive modelling.

~~~~~~~

That’s as far as I have gotten. If this discussion has given you some ideas to explore, then wonderful. I doubt I’m breaking new ground here, so if you’ve already analyzed giving-by-age, I’d be interested in hearing how you’ve applied what you’ve learned.

Incidentally, Nate Silver went on to produce “similarity scores” for pairs of hitters. Using baseball’s rich trove of data, he compared players using a nearest-neighbour analysis, which took into account a wide range of data points, from player height and weight to all the game stats that baseball is famous for. A young prospect in the minor leagues with a score that indicates a high degree of similarity with a known star might be expected to “age” in a similar way. That was the theory, anyway.

One can imagine how this might translate to the fundraising arena. If you identified groups of your best donors, with a high degree of similarity among the members of each group, you could then identify younger donors with characteristics that are similar to the members of each group. After all, major gift donors are not all alike, so why not try to fit multiple “types”?

I would guess that the relatively small size of each group would cause any signal to get drowned out in the noise. I am a little skeptical that we can parse things that finely. It would, however, be an interesting project.

A final note. The PECOTA system had some successes and for a time was an improvement on existing predictive tools. Over time, however, pure statistics were not a match for the combination of quantitative methods and the experience and knowledge of talent scouts. In the same way, identifying the best prospects for fundraising relies on the combined wisdom of data analysts, researchers and fundraisers themselves.

16 May 2014

Find out “who knows who” in your database with SQL and Python

Filed under: Fun, Prospect identification, Python, SQL — Tags: , , , , — kevinmacdonell @ 12:30 pm

Figuring out connections is a big part of major gift fundraising. A development office might have a great list of prospective donors in hand, but those prospects will languish untouched if the right people who can make introductions are never found. While cultivating relationships with well-connected people is clearly important, research also plays a role in suggesting possible go-betweens. And to the extent that research involves sifting through data to find connections, it’s possible to automate some of the work.

Wouldn’t it be great to have software that asks you to name a prospective donor, and then spits out a list of people in your database who are most likely to be associates of that person? Sure it would!

Social media applications do a version of this, suggesting “people you may know,” based on shared connections and attributes such as schools attended, professional group memberships, employment information, and stated interests. I first encountered a higher-ed advancement application of this concept via Erich Preisendorfer of the University of New Hampshire. A while back, Erich explained his approach in a guest post for CoolData. (Read it here: Finding connections to your major gift prospects in your data.) I was recently in touch with Erich and he reports that the tool is still in use by prospect researchers at UNH and is still turning up valuable connections.

I tried to build something similar using our reporting tools, but it was a slow-lumbering beast, and not very user-friendly for the people most likely to be interesting in using it — prospect researchers, mainly. I never gave up on the idea of building something better, and finally on a quiet weekend a few months ago, I found two or three hours to mess around with a Python script. Given the ID of a person, this script makes multiple queries of the database and joins the results together to produce a ranked list of other people who might be associated with the input ID.

When the script is run, it prompts the user for an ID, then an Oracle database password. I’ll get into more detail later, but here is what the output looks like in the Python shell of IDLE, the development tool that I used to write the script, when I use my own ID as the one to find connections for. (I have blanked out the ID and password.)

The script goes through these steps:

1. It takes the ID provided by the user and queries the database to display the name associated with that ID (that would be me, in this example).
2. It checks to see what events I attended, and produces a list (in memory) of all the IDs of constituents who attended those same events.
3. It checks to see what activity codes I have attached to my record — varsity sports teams, student clubs and so on. I have none, but if I did, the script would return a list of all the people who shared the same activity codes. Again, this is saved in memory for use later on.
4. The same routine is followed for employers (past and present), all class years, degrees, referring high school, cities extracted from mailing addresses (again, all past and present addresses), age, and database cross-references.
5. All of these lists are combined into a single “object” in memory. The number of instances of each distinct ID is counted up. Every ID retrieved is a “hit”. The idea is, the more hits for any one retrieved ID, the more likely that ID belongs to someone who knows me (or knows the person who we are searching on).
6. Because the resulting list is huge (more than 31,000 records just for me — and I have very little data attached to my record), the script asks the user to limit by a minimum number of hits. I’ve entered 3 here.
7. The result is a file with 47 IDs. The script then does a final query on the database to retrieve the names and other details for these people, and saves the results as a .csv file that can be opened in Excel.

Here are the top few rows of the output. I have removed some columns for simplicity, and blanked out some personal information. The count_sum column is the “hit count”, and the list is sorted in descending order by count_sum. The person being searched on is always at the top of the list. This is not an especially impressive example, because I am not an alumnus. My list of likely connections is made up of co-workers and people from my hometown who happen to be the same age as me. The woman in the third row of the data was someone I had a crush on in junior high school but haven’t seen in 27 years … If I were a prospective donor, this search would be a bust. Clearly this method requires quality alumni data in order to be useful.

That’s how it functions. We will dig into the actual code shortly. But first, let me say that my primary aim is to share this with you as an idea. I can’t help you with installing Python or debugging or any of that. All I can do is show you my code and hope you can modify it to make it work for you. If you think this is cool and useful but don’t want to learn any code, I suggest you team up with someone from IT or your Advancement Services team, or a student or prof in the Computing Science department.

As I said, I’m using Python and a code library for handling data called pandas. (For some background, see my earlier posts: Getting bitten by Python and Automate RFM scoring of your donors with this Python script.) The great thing about pandas is that you can load data into in-memory structures called DataFrames, which are like data tables in that you can perform joins and sort, rank, de-dupe, add calculated columns, and so on.

A few closing comments before I show you the code:

• Erich at UNH goes the extra step of matching on date range along with employer. I think a shared corporate connection is worth a point, but being at the same place of employment at the same time is obviously worth a lot more. If you have other ideas for things to match on, I’d love to hear them.
• Those of you who aren’t into Python but know SQL can probably make an all-SQL version of this. I just like Python.
• In the code, I have inserted comments to help guide you through. Everything to the right of a “#” is a comment.
• The script includes multiple individual queries. You may wish to lump them together using UNION ALL, appending all the results sets together right in the body of the SQL. Up to you.
• The queries that pull records based on shared events, degrees and so on all have the same basic structure: First, a WITH clause gathers the codes for the events (or degrees or cities) that the target ID has attached to his or her record, then a SELECT statement finds all the other IDs who have those same codes attached to their records.
• In order to break the SQL up into multiple lines, I have enclosed the string between groups of three single quotes. You’ll see that whenever I have to insert a variable into the SQL (such as ‘target_pidm’), I have to use what looks like a confusing jumble of single and double quotes — but the ordering is very precise.
• I was going to simplify the SQL for you, but didn’t see the point. One way or the other, you will still have to modify it. So it’s much the same as what I actually use. You’ll also notice that I tend not to use aliases for tables and views, which makes the queries text-heavy. Sorry.
• One limitation of this tool is that it will run only on a machine that has a Python installation. For that reason I haven’t been able to simply give it to someone (in Prospect Research, say) and have them run it themselves. The next step might be bundling it as a stand-alone executable file (an .exe) that could be shared with others, but I haven’t gotten that to work yet.

AND, one final, final comment: This is probably the last piece on Python and database querying I’m going to write for a while. In future posts I will tone down the technical stuff and get back to talking about data and predictive modelling.

```# Import required code libraries

import pandas as pd

import numpy as np

import pandas.io.sql as psql

import cx_Oracle, os

# User prompts, and connect to Oracle database

ID_input = raw_input('Enter ID to find connectors for: ')

user = raw_input('Enter username for database: ')

connectstr = user + '/' + password + '@database:0000'

connection = cx_Oracle.connect(connectstr)

# Query to get the database PIDM and name for the entered ID.

SQL = '''

SELECT

BANINST1.AA_ENTITY.PIDM,

BANINST1.AA_ENTITY.ID,

BANINST1.AA_ENTITY.FIRST_NAME,

BANINST1.AA_ENTITY.MI,

BANINST1.AA_ENTITY.LAST_NAME

FROM BANINST1.AA_ENTITY

WHERE

(BANINST1.AA_ENTITY.CHANGE_IND IS NULL

AND BANINST1.AA_ENTITY.ID=''' + "'" + target + "'" + ''')'''

df = psql.frame_query(SQL, con=connection)

name = str(df['FIRST_NAME']+' '+df['LAST_NAME'])

print('Searching for connections to: '+ name)

target_pidm = max(df['PIDM'])    # PIDM, not ID, is the primary key in my database.

# Find events

SQL = '''

WITH events AS (

SELECT

BANINST1.AA_EVENTS.EVNT_CRN

FROM BANINST1.AA_EVENTS

WHERE BANINST1.AA_EVENTS.PIDM = ''' +str(target_pidm)+ ''')

SELECT

events.EVNT_CRN,

BANINST1.AA_EVENTS.PIDM

FROM

events,

BANINST1.AA_EVENTS

WHERE

events.EVNT_CRN = AA_EVENTS.EVNT_CRN (+)'''

events = psql.frame_query(SQL, con=connection)

print("Found " + str(len(events)) + " people who attended the same events.")

# Find activities

SQL = '''WITH activities AS (

SELECT

FROM

WHERE

SELECT

activities.ACTP_CODE,

FROM activities,

WHERE

'''

activities = psql.frame_query(SQL, con=connection)

print("Found " + str(len(activities)) + " people with the same Activity codes.")

# Find employers

SQL = '''WITH employers AS (

SELECT

BANINST1.AA_ALL_EMPLOYMENT.EMPR_NAME

FROM

BANINST1.AA_ALL_EMPLOYMENT

WHERE

BANINST1.AA_ALL_EMPLOYMENT.PIDM = '''+str(target_pidm)+''')

SELECT

employers.EMPR_NAME,

BANINST1.AA_ALL_EMPLOYMENT.PIDM

FROM

employers,

BANINST1.AA_ALL_EMPLOYMENT

WHERE

employers.EMPR_NAME = BANINST1.AA_ALL_EMPLOYMENT.EMPR_NAME (+)'''

employers = psql.frame_query(SQL, con=connection)

print("Found " + str(len(employers)) + " people with the same Employers.")

# Find class years

SQL = '''WITH classes AS (

SELECT

BANINST1.AA_DEGREE.ACYR_CODE

FROM

BANINST1.AA_DEGREE

WHERE

BANINST1.AA_DEGREE.PIDM = '''+str(target_pidm)+''' )

SELECT

classes.ACYR_CODE,

BANINST1.AA_DEGREE.PIDM

FROM

classes,

BANINST1.AA_DEGREE

WHERE

classes.ACYR_CODE = BANINST1.AA_DEGREE.ACYR_CODE (+)'''

classes = psql.frame_query(SQL, con=connection)

print("Found " + str(len(classes)) + " people with the same class year(s).")

# Find degrees

SQL = '''

WITH degrees AS (

SELECT

BANINST1.AA_DEGREE.DEGC_CODE

FROM

BANINST1.AA_DEGREE

WHERE

BANINST1.AA_DEGREE.PIDM = '''+str(target_pidm)+'''

)

SELECT

degrees.DEGC_CODE,

BANINST1.AA_DEGREE.PIDM

FROM

degrees,

BANINST1.AA_DEGREE

WHERE

degrees.DEGC_CODE = BANINST1.AA_DEGREE.DEGC_CODE (+)'''

degrees = psql.frame_query(SQL, con=connection)

print("Found " + str(len(degrees)) + " people with the same degree(s).")

# Find high school

SQL = '''

WITH high_school AS (

SELECT

BANINST1.AA_HIGH_SCHOOL.HIGH_SCHOOL_CODE

FROM

BANINST1.AA_HIGH_SCHOOL

WHERE

BANINST1.AA_HIGH_SCHOOL.PIDM = '''+str(target_pidm)+'''

)

SELECT

high_school.HIGH_SCHOOL_CODE,

BANINST1.AA_HIGH_SCHOOL.PIDM

FROM

high_school,

BANINST1.AA_HIGH_SCHOOL

WHERE

high_school.HIGH_SCHOOL_CODE = BANINST1.AA_HIGH_SCHOOL.HIGH_SCHOOL_CODE (+)'''

hs = psql.frame_query(SQL, con=connection)

print("Found " + str(len(hs)) + " people from the same high school.")

# Find cities

SQL = '''

WITH cities AS (

SELECT

FROM

WHERE

)

SELECT DISTINCT

FROM

cities,

WHERE

cities = psql.frame_query(SQL, con=connection)

print("Found " + str(len(cities)) + " people with addresses in the same city/cities.")

# Find age

SQL = '''

WITH age AS (

SELECT

BANINST1.AA_ENTITY.AGE

FROM

BANINST1.AA_ENTITY

WHERE

BANINST1.AA_ENTITY.PIDM = '''+str(target_pidm)+'''

)

SELECT

age.AGE,

BANINST1.AA_ENTITY.PIDM

FROM

age,

BANINST1.AA_ENTITY

WHERE

age.AGE = BANINST1.AA_ENTITY.AGE (+)'''

age = psql.frame_query(SQL, con=connection)

print("Found " + str(len(age)) + " people of the same age.")

# Find cross-references

SQL = '''

WITH xref AS (

SELECT

ALUMNI.APRXREF.APRXREF_XREF_PIDM

FROM

ALUMNI.APRXREF

WHERE

ALUMNI.APRXREF.APRXREF_PIDM = '''+str(target_pidm)+'''

)

SELECT

ALUMNI.APRXREF.APRXREF_XREF_PIDM,

ALUMNI.APRXREF.APRXREF_PIDM AS PIDM

FROM

xref,

ALUMNI.APRXREF

WHERE

xref.APRXREF_XREF_PIDM = ALUMNI.APRXREF.APRXREF_XREF_PIDM (+)'''

xref = psql.frame_query(SQL, con=connection)

print("Found " + str(len(xref)) + " people who share the same cross-references.")

# Concatenate all results

print("Joining results.")

results = pd.concat([events, activities, employers, classes, degrees, hs, cities, age, xref])

# Remove unneeded columns

results = results.drop(['ACTP_CODE', 'ACYR_CODE', 'AGE', 'DEGC_CODE', 'EMPR_NAME', 'EVNT_CRN', 'HIGH_SCHOOL_CODE', 'SPRADDR_CITY', 'APRXREF_XREF_PIDM'], axis=1)

print("Found " + str(len(results)) + " total results.")

## Count up the number of times each PIDM appears in the results. (A few steps here.)

# Add a column called ‘count’ and populate every row with a ‘1’

results['count'] = 1

# Group by PIDM and sum on ‘count’ to get ‘hits per PIDM’.

results = results.join(results.groupby(results['PIDM'])['count'].sum(), on=['PIDM'], rsuffix='_sum')

# De-dupe rows in order to get one result per PIDM.

results = results.drop_duplicates()

# Sort results in descending order by the number of hits.

results.sort(columns='count_sum', axis=0, ascending=False, inplace=True)

# The file is potentially huge, so prompt the user to limit it by hit count.

cutoff = raw_input('How many hits would you like to limit by? ')

# Keep rows with minimum number of hits.

results = results[results['count_sum'] >= int(cutoff)]

# Get rid of the ‘count’ column - not needed.

results = results.drop(['count'], axis=1)

print("Found " + str(len(results)) + " unique results, with at least "+cutoff+" hits.")

## Join names to results

print('Preparing results and saving.')

SQL = '''

SELECT

BANINST1.AA_ENTITY.PIDM,

BANINST1.AA_ENTITY.ID,

BANINST1.AA_ENTITY.NAME_PREFIX,

BANINST1.AA_ENTITY.FIRST_NAME,

BANINST1.AA_ENTITY.MI,

BANINST1.AA_ENTITY.LAST_NAME,

BANINST1.AA_ENTITY.NAME_SUFFIX,

BANINST1.AA_ENTITY.AGE,

BANINST1.AA_DONOR_TYPE_HIGHEST.DONR_CODE

FROM

BANINST1.AA_ENTITY,

BANINST1.AA_DONOR_TYPE_HIGHEST

WHERE

BANINST1.AA_ENTITY.PIDM = BANINST1.AA_DONOR_TYPE_HIGHEST.PIDM (+)

AND BANINST1.AA_ENTITY.CHANGE_IND IS NULL

'''

results = pd.merge(results, psql.frame_query(SQL, con=connection), how='left', on=['PIDM'], sort=False)

results.to_csv(target + ' - ' + name + '.csv')

print('All done!')

connection.close()

```

2 May 2013

New twists on inferring age from first name

Filed under: Analytics, Coolness, Data Desk, Fun — Tags: , , , — kevinmacdonell @ 6:14 am

Not quite three years ago I blogged about a technique for estimating the age of your database constituents when you don’t have any relevant data such as birth date or class year. It was based on the idea that many first names are typically “young” or “old.” I expanded on the topic in a followup post: Putting an age-guessing trick to the test. Until now, I’ve never had a reason to guess someone’s age — alumni data is pretty well supplied in that department. This very month, though, I have not one but two major modeling projects to work on that involve constituents with very little age data present. I’ve worked out a few improvements to the technique which I will share today.

First, here’s the gist of the basic idea. Picture two women, named Freda and Katelyn. Do you imagine one of them as older than the other? I’m guessing you do. From your own experience, you know that a lot of young women and girls are named Katelyn, and that few if any older women are. Even if you aren’t sure about Freda, you would probably guess she’s older. If you plug these names into babynamewizard.com, you’ll see that Freda was a very popular baby name in the early 1900s, but fell out of the Top 1000 list sometime in the 1980s. On the other hand, Katelyn didn’t enter the Top 1000 until the 1970s and is still popular.

To make use of this information you need to turn it into data. You need to acquire a lot of data on the frequency of first names and how young or old they tend to be. If you work for a university or other school, you’re probably in luck: You might have a lot of birth dates for your alumni or, failing that, you have class years which in most cases will be a good proxy for age. This will be the source you’ll use for guessing the age of everyone else in your database — friends, parents and other person constituents — who don’t have ages. If you have a donor database that contains no age data, you might be able to source age-by-first name data somewhere else.

Back to Freda and Katelyn … when I query our database I find that the average age of constituents named Freda is 69, while the average age for Katelyn is 25. For the purpose of building a model, for anyone named Freda without an age, I will just assume she is 69, and for anyone named Katelyn, 25. It’s as simple as creating a table with two columns (First name and Average age), and matching this to your data file via First Name. My table has more than 13,500 unique first names. Some of these are single initials, and not every person goes by their first name, but that doesn’t necessarily invalidate the average age associated with them.

I’ve tested this method, and it’s an improvement over plugging missing values with an all-database average or median age. For a data set that has no age data at all, it should provide new information that wasn’t there before — information that is probably correlated with behaviours such as giving.

Now here’s a new wrinkle.

In my first post on this subject, I noted that some of the youngest names in our database are “gender flips.” Some of the more recent popular names used to be associated with the opposite gender decades ago. This seems to be most prevalent with young female names: Ainslie, Isadore, Sydney, Shelly, Brooke. It’s harder to find examples going in the other direction, but there are a few, some of them perhaps having to do with differences in ethnic origin: Kori, Dian, Karen, Shaune, Mina, Marian. In my data I have close to 600 first names that belong to members of both sexes. When I calculate average age by First Name separately for each sex, some names end up with the exact same age for male and female. These names have an androgynous quality to them: Lyndsay, Riley, Jayme, Jesse, Jody. At the other extreme are the names that have definitely flipped gender, which I’ve already given examples of … one of the largest differences being for Ainslie. The average male named Ainslie is 54 years older than the average female of the same name. (In my data, that is.)

These differences suggest an improvement to our age-inferring method: Matching on not just First Name, but Sex as well. Although only 600 of my names are double-gendered, they include many popular names, so that they actually represent almost one-quarter of all constituents.

Now here’s another wrinkle.

When we’re dealing with constituents who aren’t alumni, we may be missing certain personal information such as Sex. If we plan to match on Sex as well as First Name, we’ve got a problem. If Name Prefix is present, we can infer from whether it’s Mr., Ms., etc., but unless the person doing the data entry was having an off day, this shouldn’t be an avenue available to us — it should already be filled in. (If you know it’s “Mrs.,” then why not put in F for Sex?) For those records without a Sex recorded (or have a Sex of ‘N’), we need to make a guess. To do so, we return to our First Names query and the Sex data we do have.

In my list of 600 first names that are double-gendered, not many are actually androgynous. We have females named John and Peter, and we have males named Mary and Laura, but we all know that given any one person named John, chances are we’re talking about a male person. Mary is probably female. These may be coding errors or they may be genuine, but in any case we can use majority usage to help us decide. We’ll sometimes get it wrong — there are indeed boys named Sue — but if you have 7,000 Johns in your database and only five of them are female, then let’s assume (just for the convenience of data mining*) that all Johns are male.

So: Query your database to retrieve every first name that has a Sex code, and count up the instance of each. The default sex for each first name is decided by the highest count, male or female. To get a single variable for this, I subtract the number of females from the number of males for each first name. Since the result is positive for males and negative for females, I call it a “Maleness Score” — but you can do the reverse and call it a Femaleness Score if you wish! Results of zero are considered ties, or ‘N’.

At this point we’ve introduced a bit of circularity. For any person missing Age and Sex, first we have to guess their sex based on the majority code assigned to that person’s first name, and then go back to the same data to grab the Age that matches up with Name and Sex. Clearly we are going to get it very wrong for a lot of records. You can’t expect these guesses to hold up as well as true age data. Overall, though, there should be some signal in all that noise … if your model believes that “Edgar” is male and 72 years of age, and that “Brittany” is female and 26, well, that’s not unreasonable and it’s probably not far from the truth.

How do we put this all together? I build my models in Data Desk, so I need to get all these elements into my data file as individual variables. You can do this any way that works for you, but I use our database querying software (Hyperion Brio). I import the data into Brio as locally-saved tab-delimited files and join them up as you see below. The left table is my modeling data (or at least the part of it that holds First Name), and the two tables on the right hold the name-specific ages and sexes from all the database records that have this information available. I left-join each of these tables on the First Name field.

When I process the query, I get one row per ID with the fields from the left-hand table, plus the fields I need from the two tables on the right: the so-called Maleness Score, Female Avg Age by FName, Male Avg Age by Fname, and N Avg Age by Fname. I can now paste these as new variables into Data Desk. I still have work to do, though: I do have a small amount of “real” age data that I don’t want to overwrite, and not every First Name has a match in the alumni database. I have to figure out what I have, what I don’t have, and what I’m going to do to get a real or estimated age plugged in for every single record. I write an expression called Age Estimated to choose an age based on a hierarchical set of IF statements. The text of my expression is below — I will explain it in plain English following the expression.

```if len('AGE')>0 then 'AGE'

else if textof('SEX')="M" and len('M avg age by Fname')>0 then 'M avg age by Fname'
else if textof('SEX')="M" and len('N avg age by Fname')>0 then 'N avg age by Fname'
else if textof('SEX')="M" and len('F avg age by Fname')>0 then 'F avg age by Fname'

else if textof('SEX')="F" and len('F avg age by Fname')>0 then 'F avg age by Fname'
else if textof('SEX')="F" and len('N avg age by Fname')>0 then 'N avg age by Fname'
else if textof('SEX')="F" and len('M avg age by Fname')>0 then 'M avg age by Fname'

else if textof('SEX')="N" and 'Maleness score'>0 and len('M avg age by Fname')>0 then 'M avg age by Fname'
else if textof('SEX')="N" and 'Maleness score'<0 and len('F avg age by Fname')>0 then 'F avg age by Fname'
else if textof('SEX')="N" and 'Maleness score'=0 and len('N avg age by Fname')>0 then 'N avg age by Fname'

else if len('N avg age by Fname')>0 then 'N avg age by Fname'
else if len('F avg age by Fname')>0 then 'F avg age by Fname'
else if len('M avg age by Fname')>0 then 'M avg age by Fname'

else 49```

Okay … here’s what the expression actually does, going block by block through the statements:

1. If Age is already present, then use that — done.
2. Otherwise, if Sex is male, and the average male age is available, then use that. If there’s no average male age, then use the ‘N’ age, and if that’s not available, use the female average age … we can hope it’s better than no age at all.
3. Otherwise if Sex is female, and the average female age is available, then use that. Again, go with any other age that’s available.
4. Otherwise if Sex is ‘N’, and the Fname is likely male (according to the so-called Maleness Score), then use the male average age, if it’s available. Or if the first name is probably female, use the female average age. Or if the name is tied male-female, use the ‘N’ average age.
5. Otherwise, as it appears we don’t have anything much to go on, just use any available average age associated with that first name: ‘N’, female, or male.
6. And finally, if all else fails (which it does for about 6% of my file, or 7,000 records), just plug in the average age of every constituent in the database who has an age, which in our case is 49. This number will vary depending on the composition of your actual data file — if it’s all Parents, for example, then calculate the average of Parents’ known ages, excluding other constituent types.

When I bin the cases into 20 roughly equal groups by Estimated Age, I see that the percentage of cases that have some giving history starts very low (about 3 percent for the youngest group), rises rapidly to more than 10 percent, and then gradually rises to almost 18 percent for the oldest group. That’s heading in the right direction at least. As well, being in the oldest 5% is also very highly correlated with Lifetime Giving, which is what we would expect from a donor data set containing true ages.

This is a bit of work, and probably the gain will be marginal a lot of the time. Data on real interactions that showed evidence of engagement would be superior to age-guessing, but when data is scarce a bit of added lift can’t hurt. If you’re concerned about introducing too much noise, then build models with and without Estimated Age, and evaluate them against each other. If your software offers multiple imputation for missing data as a feature, try checking that out … what I’m doing here is just a very manual form of multiple imputation — calculating plausible values for missing data based on the values of other variables. Be careful, though: A good predictor of Age happens to be Lifetime Giving, and if your aim is to predict Giving, I should think there’s a risk your model will suffer from feedback.

* One final note …

Earlier on I mentioned assuming someone is male or female “just for the convenience of data mining.”  In our databases (and in a conventional, everyday sense too), we group people in various ways — sex, race, creed. But these categories are truly imperfect summaries of reality. (Some more imperfect than others!) A lot of human diversity is not captured in data, including things we formerly thought of as clear-cut. Sex seems conveniently binary, but in reality it is multi-category, or maybe it’s a continuous variable. (Or maybe it’s too complex for a single variable.) In real life I don’t assume that when someone in the Registrar’s Office enters ‘N’ for Sex that the student’s data is merely missing. Because the N category is still such a small slice of the population I might treat it as missing, or reapportion it to either Male or Female as I do here. But that’s strictly for predictive modeling. It’s not a statement about transgendered or differently gendered people nor an opinion about where they “belong.”

24 April 2012

Data I want to play with

Filed under: Data, Fun — Tags: — kevinmacdonell @ 5:23 am

Guest post by Marianne M. Pelletier, Director of Advancement Research and Data Support, Cornell University

In my present job, I deal with a whole lot of data – over 2,000 fields of data on gifts, names, addresses, relationships, segmenting codes, dates, attributes, interests, contacts, you name it. Yet getting to play in this playground as a donor modeler only leaves me lusting for other kinds of data to play with, so much that my hobbies often lead me to places where data lives so I can fool with it. This short article is my wish list, whether or not I’ll ever get to mine any of it.

Horse Races are tracked to the umpteenth degree by handicappers. Buy a copy of the Daily Racing Form and you’ll see more statistics presented than you can read in a week. DRF also has a web page where you can download even more statistics – tracking the horses’ pedigree generations back in time and the jockey’s entire career, ride by ride. So what do I do? I spend some Sundays diligently typing key statistics into a homemade database, along with the race results, to see if I can find the regression formula that would make me more money than just following the program picks. The answer? So far, on maiden sprints on dirt, the horse that had the fastest workout is most likely to win. For every other kind of race, I’m still wishing to buy the data in a format I can manipulate instead of having to type it.

Speaking of gambling, I’d give my remaining eye tooth to play in Harrah’s data. Harrah is an incredibly good marketing firm, from offering me a free weekend to their new casino in some remote place to being the only game in town that offers \$10 craps all weekend long. Imagine if you will getting to download affinity player card data and tracking where a person wanders in the casino – how many mix slots with table play? How many are single game players? What if the casino moved the buffet closer to Keno? What’s the best game to put right inside the valet parking entrance? Do the longer, red craps tables make one bet more or lose more? Or play longer? What is the average time for a player at a blackjack table? What if she’s drinking alcohol? What if she’s an awards card member? What if the player is male? What if the dealer is the same gender as the player? I’d be a kid in a candy store to get a contract to work data like that.

On the other side of the coin, what is the effect of parking availability on local business? Wouldn’t it be fun to figure out the dependent variable on that? Ithaca recently changed its parking rates from the first hour free to charging for every hour. Was it that or the longstanding recession that caused local businesses to disappear? Or is the turnover normal? Would I have to study when the students are in town vs. when they are gone? Would local businesses share their profit numbers with me?

And then there’s the whole thing about the best time of year to go to Disney World. I’d want to offer Disney a study of some kind (like, which ride should go next to the Small World ride?) in order to get data on when I’m most likely to enjoy good weather, a maximum number of rides open, and the fewest number of screaming children and strollers under my feet.

And speaking of flying somewhere, I’d love for Delta to hire me to study when people want to fly somewhere. All that Expedia/Travelocity search data – does anyone use it? After all, what if airlines could arrange that people in Boston can fly midmorning but people in New York can fly at night? What if there were one extra flight at 11:00 am from somewhere that would double an airline’s traffic because of the ripple effect? I’d love to be the one who discovers that.

Lastly, who can resist wishing to forecast forex? The currency exchange market is very likely very well tested by experts, but not by me. What if I could predict the day of week and time of day that the Euro drifts off against the dollar? I’d place my bet once a week and then go off to the casino. Or Disney. Or shopping. Oh, bother! It all looks like there’s data teeming everywhere, everywhere, and I’m only going to live so long.

1 February 2012

Filed under: Fun, predictive analytics, Why predictive modeling? — Tags: , , — kevinmacdonell @ 2:21 pm

I’m laying on the couch with a bad head cold, and there’s a mix of snow and rain in the forecast. Time to curl up with my laptop and a cup of tea. I’ve got a question for you!

Not long ago I asked you to give me examples of institutions you’re aware of that are shining examples of institution-wide data-driven decision making. I was grateful for the responses, but no single institution was named twice. A few people offered an opinion about how their own organizations size up, which I found interesting.

So let’s explore that a bit more with a quick and anonymous poll: Where do you think your non-profit organization or institution fits on the Culture of Analytics Ladder? (That’s CoAL for short … but no need to be so formal. I totally made this up under the influence of cold medication.) Don’t over think it. Just pick whatever stage you feel your org or institution occupies.

The categories may seem a bit vague. If it’s any help, by “analysis” or “analytics” I am referring to the process of sifting through large quantities of data in search of patterns that lead to insights, primarily about your constituents. I am NOT referring to reporting. In fact I want you to ignore a lot of the day-to-day processes that involve data but are not really “analysis,” including: data entry, gift accounting, appeal segmentation, reporting on historical results, preparation of financials, and so on.

I am thinking more along the lines of modelling for the prediction of behaviours (which group of constituents are most likely to engage in such-and-so a behaviour?), prediction of future results (i.e., forecasting), open-ended exploration of constituent data in search of “clusters”, and and any other variety of data work that would be called on to make a decision about what to do in the future, as opposed to documenting what happened in the past. I am uncertain whether A/B split testing fits my definition of analysis, but let’s be generous and say that it does.

A couple of other pointers:

• If you work for, say, a large university advancement department and aren’t sure whether analytics is used in other departments such as student admissions or recruitment, then answer just for your department. Same thing if you work for a regional office of a large non-profit and aren’t sure about the big picture.
• If you have little or no in-house expertise, but occasionally hire a vendor to produce predictive modelling scores, then you might answer “6” — but only if those scores are actually being well used.

Here we go.

17 November 2011

Add life to numbers with Excel colour scales

Filed under: Excel, Fun, Visualization — kevinmacdonell @ 7:21 am

For telling a story with data nothing beats a good visualization, which for most of us means a chart, usually created in Excel. Sometimes the situation seems to call for including the data in table format as well. You may include it after the chart, or throw it into an appendix; either way, you’re creating redundancy: The table and the chart are telling the same story.

The table offers accurate detail, but the person who’s trying to find the story in the data needs to study the table for a while to find it, even if there are only very few rows and columns. The chart is readily accessible (or should be), but probably trades away accurate detail to gain simplicity. Today I’ll show you that sometimes you can get the best of both in a single graphic.

Let’s start with a table that looks a bit complicated. This is copied from a study published on this blog by Peter Wylie and John Sammis (Is the online behaviour of your alums worth exploring?) For this example, it’s not essential that you understand what the chart is about. The important thing is to attempt to recognize patterns in the table. You can probably spot them, but only after studying the numbers and comparing values across cells, both up and down and from side to side. It’s not exactly “at a glance” comprehension.

The paper didn’t offer a chart for this table, but were we to express the data visually, we might have used something like the chart below. It’s a bit complex because of all the lines, but we can readily distinguish some key things about the Class Year deciles (that’s what “CY Decile” stands for). First of all, they’re arranged in order, from top to bottom, by age. The oldest decile starts high and stays up there, while the two youngest deciles languish at the bottom. All the rest are stacked in between. I haven’t told you what they are high IN, but let’s not worry about that for now.

Whether the chart is helpful or not, it’s still takes up extra room: One table, one chart, same data. What if we could COMBINE the two types of information — numerical and visual — into a single entity? Fortunately, newer versions of Excel offer the ability to include visual cues within the data table itself using conditional formatting. Here is the original table with colour added. Now what do you see?

That’s right, when cell are shaded according to the values they contain, the patterns are obvious. The lower the CY Decile value and the higher the Visits value, the greater is the number in the cell. Our eye is immediately drawn to the intersection of row ‘CY Decile 1′ and the column ‘8 or more visits’. And as it turns out, this is a very special cell.

How so? Well, this table was created using data from a university that tracks how many times its alumni have visited its website via links sent out in university emails — the columns break down alumni by number of web visits. As well, for this analysis, all alumni were divided up into ten equal groups (deciles) by Class Year, the oldest being CY Decile 1, and the youngest CY Decile 10.

The values in the cells are actually percentages — the percentage of alumni in each group who have Lifetime Giving of \$10,000 or more. No one should be surprised that the oldest alumni have the highest percentages, but the truly awesome finding is that frequent web visits are also associated with high levels of lifetime giving. (Anyone who has this data at their institution and fails to make use of it for prospect identification is obviously not interested in raising scads of money. Consider firing them.)

As I said, our eye is immediately drawn to the darkest cell, the 42.9% value. It’s almost like that cell is the “hottest.” Which leads me to the next version of the chart: a “heat map”. Sometimes we want to highlight both extremes, the highest AND the lowest values. It’s very intuitive to think of high values as “hot” and low values as “cool,” and colour them appropriately, like so:

Adding shading is quick and easy. Start by clicking and dragging to select the cells of your table that hold the values you want to jazz up. On the Home tab, click on the Conditional Formatting icon, and select Color Scales. Here you’ve got all sorts of options, one of which is the cold-to-hot scheme. Play with it to your heart’s content: Some shadings might be more appropriate for your data than others.

I can see this being useful for at-a-glance exploration of large tables in which patterns may be difficult to detect. For presenting your data to others, though, adding colour scales makes sense only when the physical layout of the cell values contains a visual pattern that you want to exploit to tell a story. Otherwise it’s just visual junk. The object is not just to make our charts pretty, but to make the data sing.

Older Posts »