Figuring out connections is a big part of major gift fundraising. A development office might have a great list of prospective donors in hand, but those prospects will languish untouched if the right people who can make introductions are never found. While cultivating relationships with well-connected people is clearly important, research also plays a role in suggesting possible go-betweens. And to the extent that research involves sifting through data to find connections, it’s possible to automate some of the work.
Wouldn’t it be great to have software that asks you to name a prospective donor, and then spits out a list of people in your database who are most likely to be associates of that person? Sure it would!
Social media applications do a version of this, suggesting “people you may know,” based on shared connections and attributes such as schools attended, professional group memberships, employment information, and stated interests. I first encountered a higher-ed advancement application of this concept via Erich Preisendorfer of the University of New Hampshire. A while back, Erich explained his approach in a guest post for CoolData. (Read it here: Finding connections to your major gift prospects in your data.) I was recently in touch with Erich and he reports that the tool is still in use by prospect researchers at UNH and is still turning up valuable connections.
I tried to build something similar using our reporting tools, but it was a slow-lumbering beast, and not very user-friendly for the people most likely to be interesting in using it — prospect researchers, mainly. I never gave up on the idea of building something better, and finally on a quiet weekend a few months ago, I found two or three hours to mess around with a Python script. Given the ID of a person, this script makes multiple queries of the database and joins the results together to produce a ranked list of other people who might be associated with the input ID.
When the script is run, it prompts the user for an ID, then an Oracle database password. I’ll get into more detail later, but here is what the output looks like in the Python shell of IDLE, the development tool that I used to write the script, when I use my own ID as the one to find connections for. (I have blanked out the ID and password.)
The script goes through these steps:
- It takes the ID provided by the user and queries the database to display the name associated with that ID (that would be me, in this example).
- It checks to see what events I attended, and produces a list (in memory) of all the IDs of constituents who attended those same events.
- It checks to see what activity codes I have attached to my record — varsity sports teams, student clubs and so on. I have none, but if I did, the script would return a list of all the people who shared the same activity codes. Again, this is saved in memory for use later on.
- The same routine is followed for employers (past and present), all class years, degrees, referring high school, cities extracted from mailing addresses (again, all past and present addresses), age, and database cross-references.
- All of these lists are combined into a single “object” in memory. The number of instances of each distinct ID is counted up. Every ID retrieved is a “hit”. The idea is, the more hits for any one retrieved ID, the more likely that ID belongs to someone who knows me (or knows the person who we are searching on).
- Because the resulting list is huge (more than 31,000 records just for me — and I have very little data attached to my record), the script asks the user to limit by a minimum number of hits. I’ve entered 3 here.
- The result is a file with 47 IDs. The script then does a final query on the database to retrieve the names and other details for these people, and saves the results as a .csv file that can be opened in Excel.
Here are the top few rows of the output. I have removed some columns for simplicity, and blanked out some personal information. The count_sum column is the “hit count”, and the list is sorted in descending order by count_sum. The person being searched on is always at the top of the list. This is not an especially impressive example, because I am not an alumnus. My list of likely connections is made up of co-workers and people from my hometown who happen to be the same age as me. The woman in the third row of the data was someone I had a crush on in junior high school but haven’t seen in 27 years … If I were a prospective donor, this search would be a bust. Clearly this method requires quality alumni data in order to be useful.
That’s how it functions. We will dig into the actual code shortly. But first, let me say that my primary aim is to share this with you as an idea. I can’t help you with installing Python or debugging or any of that. All I can do is show you my code and hope you can modify it to make it work for you. If you think this is cool and useful but don’t want to learn any code, I suggest you team up with someone from IT or your Advancement Services team, or a student or prof in the Computing Science department.
As I said, I’m using Python and a code library for handling data called pandas. (For some background, see my earlier posts: Getting bitten by Python and Automate RFM scoring of your donors with this Python script.) The great thing about pandas is that you can load data into in-memory structures called DataFrames, which are like data tables in that you can perform joins and sort, rank, de-dupe, add calculated columns, and so on.
A few closing comments before I show you the code:
- Erich at UNH goes the extra step of matching on date range along with employer. I think a shared corporate connection is worth a point, but being at the same place of employment at the same time is obviously worth a lot more. If you have other ideas for things to match on, I’d love to hear them.
- Those of you who aren’t into Python but know SQL can probably make an all-SQL version of this. I just like Python.
- In the code, I have inserted comments to help guide you through. Everything to the right of a “#” is a comment.
- The script includes multiple individual queries. You may wish to lump them together using UNION ALL, appending all the results sets together right in the body of the SQL. Up to you.
- The queries that pull records based on shared events, degrees and so on all have the same basic structure: First, a WITH clause gathers the codes for the events (or degrees or cities) that the target ID has attached to his or her record, then a SELECT statement finds all the other IDs who have those same codes attached to their records.
- In order to break the SQL up into multiple lines, I have enclosed the string between groups of three single quotes. You’ll see that whenever I have to insert a variable into the SQL (such as ‘target_pidm’), I have to use what looks like a confusing jumble of single and double quotes — but the ordering is very precise.
- I was going to simplify the SQL for you, but didn’t see the point. One way or the other, you will still have to modify it. So it’s much the same as what I actually use. You’ll also notice that I tend not to use aliases for tables and views, which makes the queries text-heavy. Sorry.
- One limitation of this tool is that it will run only on a machine that has a Python installation. For that reason I haven’t been able to simply give it to someone (in Prospect Research, say) and have them run it themselves. The next step might be bundling it as a stand-alone executable file (an .exe) that could be shared with others, but I haven’t gotten that to work yet.
AND, one final, final comment: This is probably the last piece on Python and database querying I’m going to write for a while. In future posts I will tone down the technical stuff and get back to talking about data and predictive modelling.
# Import required code libraries
import pandas as pd
import numpy as np
import pandas.io.sql as psql
import cx_Oracle, os
# User prompts, and connect to Oracle database
ID_input = raw_input('Enter ID to find connectors for: ')
user = raw_input('Enter username for database: ')
password = raw_input('Enter password for database: ')
connectstr = user + '/' + password + '@database:0000'
connection = cx_Oracle.connect(connectstr)
# Query to get the database PIDM and name for the entered ID.
SQL = '''
(BANINST1.AA_ENTITY.CHANGE_IND IS NULL
AND BANINST1.AA_ENTITY.ID=''' + "'" + target + "'" + ''')'''
df = psql.frame_query(SQL, con=connection)
name = str(df['FIRST_NAME']+' '+df['LAST_NAME'])
print('Searching for connections to: '+ name)
target_pidm = max(df['PIDM']) # PIDM, not ID, is the primary key in my database.
# Find events
SQL = '''
WITH events AS (
WHERE BANINST1.AA_EVENTS.PIDM = ''' +str(target_pidm)+ ''')
events.EVNT_CRN = AA_EVENTS.EVNT_CRN (+)'''
events = psql.frame_query(SQL, con=connection)
print("Found " + str(len(events)) + " people who attended the same events.")
# Find activities
SQL = '''WITH activities AS (
BANINST1.AA_ACTIVITY_AND_LEADERS.PIDM = '''+str(target_pidm)+''')
activities.ACTP_CODE = BANINST1.AA_ACTIVITY_AND_LEADERS.ACTP_CODE (+)
activities = psql.frame_query(SQL, con=connection)
print("Found " + str(len(activities)) + " people with the same Activity codes.")
# Find employers
SQL = '''WITH employers AS (
BANINST1.AA_ALL_EMPLOYMENT.PIDM = '''+str(target_pidm)+''')
employers.EMPR_NAME = BANINST1.AA_ALL_EMPLOYMENT.EMPR_NAME (+)'''
employers = psql.frame_query(SQL, con=connection)
print("Found " + str(len(employers)) + " people with the same Employers.")
# Find class years
SQL = '''WITH classes AS (
BANINST1.AA_DEGREE.PIDM = '''+str(target_pidm)+''' )
classes.ACYR_CODE = BANINST1.AA_DEGREE.ACYR_CODE (+)'''
classes = psql.frame_query(SQL, con=connection)
print("Found " + str(len(classes)) + " people with the same class year(s).")
# Find degrees
SQL = '''
WITH degrees AS (
BANINST1.AA_DEGREE.PIDM = '''+str(target_pidm)+'''
degrees.DEGC_CODE = BANINST1.AA_DEGREE.DEGC_CODE (+)'''
degrees = psql.frame_query(SQL, con=connection)
print("Found " + str(len(degrees)) + " people with the same degree(s).")
# Find high school
SQL = '''
WITH high_school AS (
BANINST1.AA_HIGH_SCHOOL.PIDM = '''+str(target_pidm)+'''
high_school.HIGH_SCHOOL_CODE = BANINST1.AA_HIGH_SCHOOL.HIGH_SCHOOL_CODE (+)'''
hs = psql.frame_query(SQL, con=connection)
print("Found " + str(len(hs)) + " people from the same high school.")
# Find cities
SQL = '''
WITH cities AS (
SATURN.SPRADDR.SPRADDR_PIDM = '''+str(target_pidm)+'''
SATURN.SPRADDR.SPRADDR_PIDM AS PIDM
cities.SPRADDR_CITY = SATURN.SPRADDR.SPRADDR_CITY (+)'''
cities = psql.frame_query(SQL, con=connection)
print("Found " + str(len(cities)) + " people with addresses in the same city/cities.")
# Find age
SQL = '''
WITH age AS (
BANINST1.AA_ENTITY.PIDM = '''+str(target_pidm)+'''
age.AGE = BANINST1.AA_ENTITY.AGE (+)'''
age = psql.frame_query(SQL, con=connection)
print("Found " + str(len(age)) + " people of the same age.")
# Find cross-references
SQL = '''
WITH xref AS (
ALUMNI.APRXREF.APRXREF_PIDM = '''+str(target_pidm)+'''
ALUMNI.APRXREF.APRXREF_PIDM AS PIDM
xref.APRXREF_XREF_PIDM = ALUMNI.APRXREF.APRXREF_XREF_PIDM (+)'''
xref = psql.frame_query(SQL, con=connection)
print("Found " + str(len(xref)) + " people who share the same cross-references.")
# Concatenate all results
results = pd.concat([events, activities, employers, classes, degrees, hs, cities, age, xref])
# Remove unneeded columns
results = results.drop(['ACTP_CODE', 'ACYR_CODE', 'AGE', 'DEGC_CODE', 'EMPR_NAME', 'EVNT_CRN', 'HIGH_SCHOOL_CODE', 'SPRADDR_CITY', 'APRXREF_XREF_PIDM'], axis=1)
print("Found " + str(len(results)) + " total results.")
## Count up the number of times each PIDM appears in the results. (A few steps here.)
# Add a column called ‘count’ and populate every row with a ‘1’
results['count'] = 1
# Group by PIDM and sum on ‘count’ to get ‘hits per PIDM’.
results = results.join(results.groupby(results['PIDM'])['count'].sum(), on=['PIDM'], rsuffix='_sum')
# De-dupe rows in order to get one result per PIDM.
results = results.drop_duplicates()
# Sort results in descending order by the number of hits.
results.sort(columns='count_sum', axis=0, ascending=False, inplace=True)
# The file is potentially huge, so prompt the user to limit it by hit count.
cutoff = raw_input('How many hits would you like to limit by? ')
# Keep rows with minimum number of hits.
results = results[results['count_sum'] >= int(cutoff)]
# Get rid of the ‘count’ column - not needed.
results = results.drop(['count'], axis=1)
print("Found " + str(len(results)) + " unique results, with at least "+cutoff+" hits.")
## Join names to results
print('Preparing results and saving.')
SQL = '''
BANINST1.AA_ENTITY.PIDM = BANINST1.AA_DONOR_TYPE_HIGHEST.PIDM (+)
AND BANINST1.AA_ENTITY.CHANGE_IND IS NULL
results = pd.merge(results, psql.frame_query(SQL, con=connection), how='left', on=['PIDM'], sort=False)
results.to_csv(target + ' - ' + name + '.csv')