CoolData blog

16 May 2014

Find out “who knows who” in your database with SQL and Python

Filed under: Fun, Prospect identification, Python, SQL — Tags: , , , , — kevinmacdonell @ 12:30 pm

Figuring out connections is a big part of major gift fundraising. A development office might have a great list of prospective donors in hand, but those prospects will languish untouched if the right people who can make introductions are never found. While cultivating relationships with well-connected people is clearly important, research also plays a role in suggesting possible go-betweens. And to the extent that research involves sifting through data to find connections, it’s possible to automate some of the work.

Wouldn’t it be great to have software that asks you to name a prospective donor, and then spits out a list of people in your database who are most likely to be associates of that person? Sure it would!

Social media applications do a version of this, suggesting “people you may know,” based on shared connections and attributes such as schools attended, professional group memberships, employment information, and stated interests. I first encountered a higher-ed advancement application of this concept via Erich Preisendorfer of the University of New Hampshire. A while back, Erich explained his approach in a guest post for CoolData. (Read it here: Finding connections to your major gift prospects in your data.) I was recently in touch with Erich and he reports that the tool is still in use by prospect researchers at UNH and is still turning up valuable connections.

I tried to build something similar using our reporting tools, but it was a slow-lumbering beast, and not very user-friendly for the people most likely to be interesting in using it — prospect researchers, mainly. I never gave up on the idea of building something better, and finally on a quiet weekend a few months ago, I found two or three hours to mess around with a Python script. Given the ID of a person, this script makes multiple queries of the database and joins the results together to produce a ranked list of other people who might be associated with the input ID.

When the script is run, it prompts the user for an ID, then an Oracle database password. I’ll get into more detail later, but here is what the output looks like in the Python shell of IDLE, the development tool that I used to write the script, when I use my own ID as the one to find connections for. (I have blanked out the ID and password.)

connections_output copy

The script goes through these steps:

  1. It takes the ID provided by the user and queries the database to display the name associated with that ID (that would be me, in this example).
  2. It checks to see what events I attended, and produces a list (in memory) of all the IDs of constituents who attended those same events.
  3. It checks to see what activity codes I have attached to my record — varsity sports teams, student clubs and so on. I have none, but if I did, the script would return a list of all the people who shared the same activity codes. Again, this is saved in memory for use later on.
  4. The same routine is followed for employers (past and present), all class years, degrees, referring high school, cities extracted from mailing addresses (again, all past and present addresses), age, and database cross-references.
  5. All of these lists are combined into a single “object” in memory. The number of instances of each distinct ID is counted up. Every ID retrieved is a “hit”. The idea is, the more hits for any one retrieved ID, the more likely that ID belongs to someone who knows me (or knows the person who we are searching on).
  6. Because the resulting list is huge (more than 31,000 records just for me — and I have very little data attached to my record), the script asks the user to limit by a minimum number of hits. I’ve entered 3 here.
  7. The result is a file with 47 IDs. The script then does a final query on the database to retrieve the names and other details for these people, and saves the results as a .csv file that can be opened in Excel.

Here are the top few rows of the output. I have removed some columns for simplicity, and blanked out some personal information. The count_sum column is the “hit count”, and the list is sorted in descending order by count_sum. The person being searched on is always at the top of the list. This is not an especially impressive example, because I am not an alumnus. My list of likely connections is made up of co-workers and people from my hometown who happen to be the same age as me. The woman in the third row of the data was someone I had a crush on in junior high school but haven’t seen in 27 years … If I were a prospective donor, this search would be a bust. Clearly this method requires quality alumni data in order to be useful.

 

result2

That’s how it functions. We will dig into the actual code shortly. But first, let me say that my primary aim is to share this with you as an idea. I can’t help you with installing Python or debugging or any of that. All I can do is show you my code and hope you can modify it to make it work for you. If you think this is cool and useful but don’t want to learn any code, I suggest you team up with someone from IT or your Advancement Services team, or a student or prof in the Computing Science department.

As I said, I’m using Python and a code library for handling data called pandas. (For some background, see my earlier posts: Getting bitten by Python and Automate RFM scoring of your donors with this Python script.) The great thing about pandas is that you can load data into in-memory structures called DataFrames, which are like data tables in that you can perform joins and sort, rank, de-dupe, add calculated columns, and so on.

A few closing comments before I show you the code:

  • Erich at UNH goes the extra step of matching on date range along with employer. I think a shared corporate connection is worth a point, but being at the same place of employment at the same time is obviously worth a lot more. If you have other ideas for things to match on, I’d love to hear them.
  • Those of you who aren’t into Python but know SQL can probably make an all-SQL version of this. I just like Python.
  • In the code, I have inserted comments to help guide you through. Everything to the right of a “#” is a comment.
  • The script includes multiple individual queries. You may wish to lump them together using UNION ALL, appending all the results sets together right in the body of the SQL. Up to you.
  • The queries that pull records based on shared events, degrees and so on all have the same basic structure: First, a WITH clause gathers the codes for the events (or degrees or cities) that the target ID has attached to his or her record, then a SELECT statement finds all the other IDs who have those same codes attached to their records.
  • In order to break the SQL up into multiple lines, I have enclosed the string between groups of three single quotes. You’ll see that whenever I have to insert a variable into the SQL (such as ‘target_pidm’), I have to use what looks like a confusing jumble of single and double quotes — but the ordering is very precise.
  • I was going to simplify the SQL for you, but didn’t see the point. One way or the other, you will still have to modify it. So it’s much the same as what I actually use. You’ll also notice that I tend not to use aliases for tables and views, which makes the queries text-heavy. Sorry.
  • One limitation of this tool is that it will run only on a machine that has a Python installation. For that reason I haven’t been able to simply give it to someone (in Prospect Research, say) and have them run it themselves. The next step might be bundling it as a stand-alone executable file (an .exe) that could be shared with others, but I haven’t gotten that to work yet.

AND, one final, final comment: This is probably the last piece on Python and database querying I’m going to write for a while. In future posts I will tone down the technical stuff and get back to talking about data and predictive modelling.

 

# Import required code libraries

 

import pandas as pd

import numpy as np

import pandas.io.sql as psql

import cx_Oracle, os

 

 

# User prompts, and connect to Oracle database

 

ID_input = raw_input('Enter ID to find connectors for: ')

 

user = raw_input('Enter username for database: ')

 

password = raw_input('Enter password for database: ')

 

connectstr = user + '/' + password + '@database:0000'

 

connection = cx_Oracle.connect(connectstr)

 

 

# Query to get the database PIDM and name for the entered ID.

 

SQL = '''

SELECT

BANINST1.AA_ENTITY.PIDM,

BANINST1.AA_ENTITY.ID,

BANINST1.AA_ENTITY.FIRST_NAME,

BANINST1.AA_ENTITY.MI,

BANINST1.AA_ENTITY.LAST_NAME

 

FROM BANINST1.AA_ENTITY

 

WHERE

(BANINST1.AA_ENTITY.CHANGE_IND IS NULL

AND BANINST1.AA_ENTITY.ID=''' + "'" + target + "'" + ''')'''

 

df = psql.frame_query(SQL, con=connection)

 

name = str(df['FIRST_NAME']+' '+df['LAST_NAME'])

 

print('Searching for connections to: '+ name)

 

target_pidm = max(df['PIDM'])    # PIDM, not ID, is the primary key in my database.

 

 

# Find events

 

SQL = '''

WITH events AS (

SELECT

BANINST1.AA_EVENTS.EVNT_CRN

 

FROM BANINST1.AA_EVENTS

 

WHERE BANINST1.AA_EVENTS.PIDM = ''' +str(target_pidm)+ ''')

 

SELECT

events.EVNT_CRN,

BANINST1.AA_EVENTS.PIDM

 

FROM

events,

BANINST1.AA_EVENTS

 

WHERE

events.EVNT_CRN = AA_EVENTS.EVNT_CRN (+)'''

 

events = psql.frame_query(SQL, con=connection)

 

print("Found " + str(len(events)) + " people who attended the same events.")

 

 

# Find activities

 

SQL = '''WITH activities AS (

SELECT

BANINST1.AA_ACTIVITY_AND_LEADERS.ACTP_CODE

 

FROM

BANINST1.AA_ACTIVITY_AND_LEADERS

 

WHERE

BANINST1.AA_ACTIVITY_AND_LEADERS.PIDM = '''+str(target_pidm)+''')

 

SELECT

activities.ACTP_CODE,

BANINST1.AA_ACTIVITY_AND_LEADERS.PIDM

 

FROM activities,

BANINST1.AA_ACTIVITY_AND_LEADERS

 

WHERE

activities.ACTP_CODE = BANINST1.AA_ACTIVITY_AND_LEADERS.ACTP_CODE (+)

'''

 

activities = psql.frame_query(SQL, con=connection)

 

print("Found " + str(len(activities)) + " people with the same Activity codes.")

 

 

# Find employers

 

SQL = '''WITH employers AS (

SELECT

BANINST1.AA_ALL_EMPLOYMENT.EMPR_NAME

 

FROM

BANINST1.AA_ALL_EMPLOYMENT

 

WHERE

BANINST1.AA_ALL_EMPLOYMENT.PIDM = '''+str(target_pidm)+''')

 

SELECT

employers.EMPR_NAME,

BANINST1.AA_ALL_EMPLOYMENT.PIDM

 

FROM

employers,

BANINST1.AA_ALL_EMPLOYMENT

 

WHERE

employers.EMPR_NAME = BANINST1.AA_ALL_EMPLOYMENT.EMPR_NAME (+)'''

 

employers = psql.frame_query(SQL, con=connection)

 

print("Found " + str(len(employers)) + " people with the same Employers.")

 

 

# Find class years

 

SQL = '''WITH classes AS (

SELECT

BANINST1.AA_DEGREE.ACYR_CODE

 

FROM

BANINST1.AA_DEGREE

 

WHERE

BANINST1.AA_DEGREE.PIDM = '''+str(target_pidm)+''' )

 

SELECT

classes.ACYR_CODE,

BANINST1.AA_DEGREE.PIDM

 

FROM

classes,

BANINST1.AA_DEGREE

 

WHERE

classes.ACYR_CODE = BANINST1.AA_DEGREE.ACYR_CODE (+)'''

 

classes = psql.frame_query(SQL, con=connection)

 

print("Found " + str(len(classes)) + " people with the same class year(s).")

 

 

# Find degrees

 

SQL = '''

WITH degrees AS (

SELECT

BANINST1.AA_DEGREE.DEGC_CODE

 

FROM

BANINST1.AA_DEGREE

 

WHERE

BANINST1.AA_DEGREE.PIDM = '''+str(target_pidm)+'''

)

 

SELECT

degrees.DEGC_CODE,

BANINST1.AA_DEGREE.PIDM

 

FROM

degrees,

BANINST1.AA_DEGREE

 

WHERE

degrees.DEGC_CODE = BANINST1.AA_DEGREE.DEGC_CODE (+)'''

 

degrees = psql.frame_query(SQL, con=connection)

 

print("Found " + str(len(degrees)) + " people with the same degree(s).")

 

 

# Find high school

 

SQL = '''

WITH high_school AS (

SELECT

BANINST1.AA_HIGH_SCHOOL.HIGH_SCHOOL_CODE

 

FROM

BANINST1.AA_HIGH_SCHOOL

 

WHERE

BANINST1.AA_HIGH_SCHOOL.PIDM = '''+str(target_pidm)+'''

)

 

SELECT

high_school.HIGH_SCHOOL_CODE,

BANINST1.AA_HIGH_SCHOOL.PIDM

 

FROM

high_school,

BANINST1.AA_HIGH_SCHOOL

 

WHERE

high_school.HIGH_SCHOOL_CODE = BANINST1.AA_HIGH_SCHOOL.HIGH_SCHOOL_CODE (+)'''

 

 

hs = psql.frame_query(SQL, con=connection)

 

print("Found " + str(len(hs)) + " people from the same high school.")

 

 

 

# Find cities

 

SQL = '''

WITH cities AS (

SELECT

SATURN.SPRADDR.SPRADDR_CITY

 

FROM

SATURN.SPRADDR

 

WHERE

SATURN.SPRADDR.SPRADDR_PIDM = '''+str(target_pidm)+'''

)

 

SELECT DISTINCT

cities.SPRADDR_CITY,

SATURN.SPRADDR.SPRADDR_PIDM AS PIDM

 

FROM

cities,

SATURN.SPRADDR

 

WHERE

cities.SPRADDR_CITY = SATURN.SPRADDR.SPRADDR_CITY (+)'''

 

 

cities = psql.frame_query(SQL, con=connection)

 

print("Found " + str(len(cities)) + " people with addresses in the same city/cities.")

 

 

# Find age

 

SQL = '''

WITH age AS (

SELECT

BANINST1.AA_ENTITY.AGE

 

FROM

BANINST1.AA_ENTITY

 

WHERE

BANINST1.AA_ENTITY.PIDM = '''+str(target_pidm)+'''

)

 

SELECT

age.AGE,

BANINST1.AA_ENTITY.PIDM

 

FROM

age,

BANINST1.AA_ENTITY

 

WHERE

age.AGE = BANINST1.AA_ENTITY.AGE (+)'''

 

 

age = psql.frame_query(SQL, con=connection)

 

print("Found " + str(len(age)) + " people of the same age.")

 

 

# Find cross-references

 

SQL = '''

WITH xref AS (

SELECT

ALUMNI.APRXREF.APRXREF_XREF_PIDM

 

FROM

ALUMNI.APRXREF

 

WHERE

ALUMNI.APRXREF.APRXREF_PIDM = '''+str(target_pidm)+'''

)

 

SELECT

ALUMNI.APRXREF.APRXREF_XREF_PIDM,

ALUMNI.APRXREF.APRXREF_PIDM AS PIDM

 

FROM

xref,

ALUMNI.APRXREF

 

WHERE

xref.APRXREF_XREF_PIDM = ALUMNI.APRXREF.APRXREF_XREF_PIDM (+)'''

 

 

xref = psql.frame_query(SQL, con=connection)

 

print("Found " + str(len(xref)) + " people who share the same cross-references.")

 

 

 

# Concatenate all results

 

print("Joining results.")

 

results = pd.concat([events, activities, employers, classes, degrees, hs, cities, age, xref])

 

# Remove unneeded columns

 

results = results.drop(['ACTP_CODE', 'ACYR_CODE', 'AGE', 'DEGC_CODE', 'EMPR_NAME', 'EVNT_CRN', 'HIGH_SCHOOL_CODE', 'SPRADDR_CITY', 'APRXREF_XREF_PIDM'], axis=1)

 

print("Found " + str(len(results)) + " total results.")

 

## Count up the number of times each PIDM appears in the results. (A few steps here.)

 

# Add a column called ‘count’ and populate every row with a ‘1’

 

results['count'] = 1

 

# Group by PIDM and sum on ‘count’ to get ‘hits per PIDM’.

 

results = results.join(results.groupby(results['PIDM'])['count'].sum(), on=['PIDM'], rsuffix='_sum')

 

# De-dupe rows in order to get one result per PIDM.

 

results = results.drop_duplicates()

 

# Sort results in descending order by the number of hits.

 

results.sort(columns='count_sum', axis=0, ascending=False, inplace=True)

 

# The file is potentially huge, so prompt the user to limit it by hit count.

 

cutoff = raw_input('How many hits would you like to limit by? ')

 

# Keep rows with minimum number of hits.

 

results = results[results['count_sum'] >= int(cutoff)]

 

# Get rid of the ‘count’ column - not needed.

 

results = results.drop(['count'], axis=1)

 

print("Found " + str(len(results)) + " unique results, with at least "+cutoff+" hits.")

 

 

## Join names to results

 

print('Preparing results and saving.')

 

SQL = '''

SELECT

BANINST1.AA_ENTITY.PIDM,

BANINST1.AA_ENTITY.ID,

BANINST1.AA_ENTITY.NAME_PREFIX,

BANINST1.AA_ENTITY.FIRST_NAME,

BANINST1.AA_ENTITY.MI,

BANINST1.AA_ENTITY.LAST_NAME,

BANINST1.AA_ENTITY.NAME_SUFFIX,

BANINST1.AA_ENTITY.AGE,

BANINST1.AA_ENTITY.DEAD_IND,

BANINST1.AA_DONOR_TYPE_HIGHEST.DONR_CODE

 

FROM

BANINST1.AA_ENTITY,

BANINST1.AA_DONOR_TYPE_HIGHEST

 

WHERE

BANINST1.AA_ENTITY.PIDM = BANINST1.AA_DONOR_TYPE_HIGHEST.PIDM (+)

AND BANINST1.AA_ENTITY.CHANGE_IND IS NULL

'''

 

results = pd.merge(results, psql.frame_query(SQL, con=connection), how='left', on=['PIDM'], sort=False)

 

results.to_csv(target + ' - ' + name + '.csv')

 

print('All done!')

 

 

connection.close()

 
Advertisement

26 February 2014

Automate RFM scoring of your donors with this Python script

Filed under: Python, RFM — Tags: , , , — kevinmacdonell @ 9:20 pm

The next time you’re asked to refresh RFM scores for your donor base, tell them it’ll take you the better part of the day. Then fire up this script, output the scores in 30 seconds, and use the rest of your morning to work on something more interesting.

Unlike predictive modeling, which ideally requires an analyst’s good judgement in order to do well, RFM scoring lends itself to being completely automated. Based on a donor’s most recent five years of giving history, an RFM score is composed of three simple elements, which can each be scored 1 to 5. The highest score combination is 5-5-5.

  1. Recency: “5” if the donor’s most recent gift was last year, “4” if it was the year before that, and so on.
  2. Frequency: “5” if the donor gave in five out of five past years, “4” if the donor gave in four out of five, and so on.
  3. Monetary value: “5” if the donor’s total giving in the five years is in the top 20% of the donor file, “4” if total giving is in the next 20%, and so on.

This post follows naturally on my previous post, in which I showed how the PIVOT operator in an SQL statement can be used on donor data to arrange yearly giving totals into columns instead of rows, right at the level of the database query. (See Really swell SQL: Why you must know PIVOT and WITH.)

This Python script includes some sample SQL to pull giving data and pivot it — you will need to modify this SQL to match your schema and table names. If your database does not support PIVOT, then just pull the raw data and research how to pivot the data in Python. In fact, pivots and other data manipulations are much easier to do in Python than in SQL if you use a Python code library called pandas, designed specifically for working with data for analysis. Pandas has structures called DataFrames that are like database tables and are very intuitive to work with.

If you can’t connect directly to your database with this script, then you can just as easily read a .csv file into a DataFrame and work with that instead. The source file has to have one column for IDs, and five columns of yearly giving totals, with nulls for zero totals.

I am not going into a lot of detail about how each pandas function works. You can find as much information as you want via internet searches. (See another previous CoolData post, Getting bitten by Python.) I don’t promise that my code is elegant. Play with it, improve on it, and extend it to fit your needs.

In addition to comments in the body of the script below (anything after a “#”), here’s a limited explanation of what each section of the RFM calculation does:

RECENCY: The script creates five new columns. For each row of data, the field evaluates to True if the field is not null (i.e., if the donor has giving in the year being referenced). In Python, ‘True’ is the same as 1, and ‘False’ is the same as zero. If the donor had giving in the most recent year, the first new column will evaluate to True, and when I multiply ‘True’ by 5, the answer is 5. If no giving, the result is zero, and multiplying by 5 equals zero. The same goes for the next column: If the donor had giving in the year before, the result is True, and when I multiply by 4, the answer is 4. And so on, down to five years ago (‘1’). Then I create a sixth new column, which is populated by the maximum value found in the previous five columns — this ends up being the Recency portion of the score. Finally, I delete (or “drop”) the previous five columns, as they are no longer needed.

FREQUENCY: This part is way simpler … just a count of the number of non-null values in the five columns of yearly giving totals.

MONETARY VALUE: First, we sum on giving for all five years. Then these totals are chopped up into quintiles, ignoring any null totals. (Quintiles means five equal-sized groups, each making up 20% of the total number of records.) Each quintile is given a label (1 to 5), with the top quintile being a “5”.

That’s it. The results are saved to a .csv file, which looks like this:

output

 

 

# import required code modules.
import pandas as pd
import pandas.io.sql as psql
import cx_Oracle   # for querying an Oracle database
import os  # for reading and writing files to your hard drive

# Prompt user for database username and password.
username = raw_input('Enter username: ')
password = raw_input('Enter password: ')

# Pass text of SQL statement to variable called SQL, enclosed in triple quotes.
# (Note the pivot ... five years of giving data.)

SQL = '''
SELECT *

FROM (
 SELECT TABLE_A.ID,
 SUM ( TABLE_B.GIFT_AMT ) AS GIVING,
 TABLE_B.GIFT_FISC_CODE AS FY

 FROM
 TABLE_A,
 TABLE_B

 WHERE
 (TABLE_A.ID = TABLE_B.ID
  AND TABLE_B.GIFT_FISC_CODE IN (2010, 2011, 2012, 2013, 2014))

 GROUP BY
 TABLE_A.ID,
 TABLE_B.GIFT_FISC_CODE
 )

PIVOT (
  SUM ( GIVING ) AS total FOR ( FY )
  IN (2010 as FY2010, 2011 as FY2011, 2012 as FY2012, 2013 as FY2013, 2014 as FY2014)
  )

'''

# Connect, execute the SQL, and put data into a pandas DataFrame (df)
# "@dsn" (data source name) is the TNS entry (from the Oracle names server or tnsnames.ora file)
# "0000" is the port number

connectstr = username + '/' + password + '@dsn:0000'
connection = cx_Oracle.connect(connectstr)
df = psql.frame_query(SQL, con=connection) 

## RFM score

# RECENCY
# Create five new columns.
df['Recency5'] = df.FY2014_TOTAL.notnull() * 5
df['Recency4'] = df.FY2013_TOTAL.notnull() * 4
df['Recency3'] = df.FY2012_TOTAL.notnull() * 3
df['Recency2'] = df.FY2011_TOTAL.notnull() * 2
df['Recency1'] = df.FY2010_TOTAL.notnull() * 1

# Create a sixth column, setting it to the max value of the previous five columns.
df['Recency'] = df[['Recency5', 'Recency4', 'Recency3', 'Recency2', 'Recency1']].max(axis=1)

# Delete the five original columns - no longer needed.
df = df.drop(['Recency5', 'Recency4', 'Recency3', 'Recency2', 'Recency1'], axis=1)

# FREQUENCY

# Create new column, and set it to the number of non-null values in the giving totals columns.
df['Frequency'] = df[['FY2010_TOTAL', 'FY2011_TOTAL', 'FY2012_TOTAL', 'FY2013_TOTAL', 'FY2014_TOTAL']].count(axis=1)

# MONETARY VALUE

# Create new column, the sum of all giving years.
df['Giving_Total'] = df[['FY2010_TOTAL', 'FY2011_TOTAL', 'FY2012_TOTAL', 'FY2013_TOTAL', 'FY2014_TOTAL']].sum(axis=1, skipna=True)

# Break all giving totals into five quintiles (ignoring any null values), and store the label (1 to 5) in another new column.
quintile_bins = [1, 2, 3, 4, 5]
df['Monetary'] = pd.qcut(df['Giving_Total'], 5, labels = quintile_bins)

# Write results to a .csv file, and display number of saved records to user.
df.to_csv('RFM_results.csv')
print(str(len(df)) + " records output")

30 July 2013

Getting bitten by Python

When I was first learning to build predictive models, preparing the data was part of the adventure. In time, though, many operations on the data became standard instead of exploratory. Eventually they became simply repetitive and tedious. When any task becomes repetitive, I think of ways to automate it. Given that data prep makes up 80 percent of the work of building a model (according to some authors), the benefits of automation are obvious.

I can think of only two ways to replicate the manual operations you need to perform on a large data set to make it ready for modelling: Use software specially designed for the task, or code your own data-handling scripts. I am lazy and drawn to software solutions that make hard things easy, and I’m not a programmer. Yet I have veered away from a ready-made software solution to pursue an interest in the scripting language called Python, and in particular the Python code library called pandas, written specifically for working with data.

Maybe it’s because Python is open-source and free, or because it is powerful, or because it is flexible and widely adaptable to multiple uses on the job. I don’t know. But for the past few months I’ve been obsessed with learning to use it, and that’s what I’d like to talk about today.

I’m guessing very few CoolData readers have experience writing scripts for handling data. I know some people who do most of their stats work in the R language or manipulate data in Excel using VBA. But the majority of readers probably consider themselves severely allergic to coding of any kind. I concede that it isn’t for everyone, but look: Just as we don’t need to be professional statisticians to use statistical tools to create value for the business, we don’t need to be computer scientists to write useful scripts that can free up large chunks of time we now spend on routine tasks that bore us.

(If you work with someone in IT or Advancement Services who pulls and reshapes your data for you, they might be especially interested in the idea of learning how to automate your requests. They might also be familiar with Python already.)

I should say here that my aim is not to automate predictive modelling itself. There are Python modules for modelling, too, from the venerable classics such as regression to the latest advanced techniques. But I’m not so much interested in them, not yet at least. Building predictive models is best done hands-on, guided by a human modeler’s expertise and domain knowledge. My main interest is in eliminating a big chunk of the standard rote work so that I can apply the freshest version of myself to the more interesting and creative elements of data exploration and model creation.

So what is Python (and more specifically, pandas) good for?

  • A script or program can execute a series of database queries and join the results in exactly the way you want, allowing you to build very complex structures and incorporate custom aggregations that might be harder to do using your existing querying/reporting tools. For example, let’s say you want to build a file of donors and include columns for date of first and last gift, amount of highest gift, total cash gifts for the past five fiscal years, and percentage of total giving devoted to student financial assistance. Unless IT has built some advanced views for you from the base tables in your database, many of these variables will require applying some calculations to the raw transactional data. I could certainly build a query to get the results for this modest example, but it would involve a few sub-queries and calculated fields. Multiply that by a hundred and you’ve got an idea of how complex a query you’d have to build to deliver a modelling-ready data set. In fact it may be technically impossible, or at least difficult, to build such a single massive query. In Python, however, you can build your data file up in an orderly series of steps. Adding, removing or editing those steps is not a big deal.
  • Python also makes it simple to read data from .csv and Excel files, and merge it painlessly with the data you’ve extracted from your database. This is important to me because not all of my modelling data comes from our database. I’ve got eight years of call centre data results by alumni ID, wealth-related census data by Canadian postal code, capacity data by American ZIP code, and other standalone data sets. Adding these variables to my file used to be a tedious, manual process. In Python, left-joining 20 columns of census data to a file of 100,000 alumni records using Postal Code as the join key takes a single line of code and executes faster than a knight can say “Ni!” (Inside Python joke.)
  • Many other common operations also take only one or two lines of code, including conversion of categorical variables to 0/1 dummy variables, performing transformations and mathematical operations on variables, filling in or imputing missing data with constants or calculated values, pivoting data, and creating new variables from existing ones via concatenation (for strings) or math (for numbers).
  • With a script, you can also iterate over the rows of a data file and perform different operations based on conditional statements.

I’m not going to provide a Python tutorial today (although I’m tempted to do so in the future), but here is a sample line of code from a script, with a description of what it does. This doesn’t give you enough information to do anything useful, but you’ll at least see how compact and powerful the language is.

Skipping some necessary preliminaries, let’s say you’ve just used Python to query your Oracle database to read into memory a data set containing the variables ID, Constituent Category, Sex, and Age for all living constituent persons. (An operation that itself takes little more than two or three lines of code.) Obviously it depends on your database and code structure, but let’s say “Constituent Category” includes codes for such categories as Alumnus/na (ALUM), Non-degreed alumni (ALND), Parent (PRNT), Friend (FRND), Faculty (FCTY), Staff (STAF), and so on. And let’s further assume that a constituent can belong to multiple categories. Most people will have only one code, but it’s possible that a person can simultaneously be an alum, a parent, and a faculty member.

In our script, the data is read into a structure called a DataFrame (a tool provided by the pandas code library). This should sound familiar to users of R in particular. For the rest of us, a DataFrame is very much like a database table, with named columns and numbered (“indexed”) rows. Had we pasted the data into Excel instead, it might look like this:

pivot1

Right away we see that William and Janet are represented by multiple rows because they have multiple constituent codes. This won’t work for predictive modelling, which requires that we have just one row per individual – otherwise certain individuals would carry more weight in the model than they should. You could say that multiple records for Janet means that 60-year-old females are over-represented in the data. We could delete the extra rows, but we don’t want to do that because we’d be throwing away important information that is almost certainly informative of our modelling target, eg. likelihood to make a donation.

In order to keep this information while avoiding duplicate IDs, we need to pivot the data so that each category of Constituent Code (ALUM, PRNT, etc.) becomes its own column. The result we want would look like this in Excel:

pivot2

The Con_Code column is gone, and replaced with a series of columns, each a former category of Con_Code. In each column is either a 0 or 1, a “dummy variable” indicating whether an individual belongs to that constituency or not.

Getting the data from the first state to the final state requires just three lines of code in Python/pandas:

df = pd.merge(df, pd.crosstab(df.ID, df.Con_Code), how='left', left_on='ID', right_index=True)

df = df.drop(['Con_Code'], axis=1)

df = df.drop_duplicates()

This snippet of code may look invitingly simple or simply terrifying – it depends on your background. Whatever – it doesn’t matter, because my point is only that these three lines are very short, requiring very little typing, yet they elegantly handle a common data prep task that I have spent many hours performing manually.

Here’s a brief summary of what each line does:

Line 1: There’s a lot going on here … First, “df” is just the name of the DataFrame object. I could have called it anything. On the right-hand side, you see “pd” (which is shorthand for pandas, the module of code that is doing the work), then “crosstab,” (a function that performs the actual pivot). In the parentheses after pd.crosstab, we have specified the two columns to use in the pivot: df.ID is the data we want for the rows, and df.Con_Code is the column of categories that we want to expand into as many columns as there are categories. You don’t have to know in advance how many categories exist in your data, or what they are – Python just does it.

Pd.crosstab creates a new table containing only ID and all the new columns. That entity (or “object”) is just sitting out there, invisible, in your computer’s memory. We need to join it back to our original data set so that it is reunited with Age, Sex and whatever other stuff you’ve got. That’s what “pd.merge” does. Again, “pd” is just referencing the pandas module that is providing the “merge” function. The operation is called “merge,” but it’s much the same thing as an SQL-type join, familiar to anyone who queries a database. The merge takes two inputs, our original DataFrame (“df”), and the result from the crosstab operation that I described above. The argument called “how” specifies that we want to perform the equivalent of a left-join. A couple of other optional arguments explicitly tell Python which column to use as a join key (‘ID’).

The crosstab operation is enclosed within the merge operation. I could have separated these into multiple lines, which would have been less confusing, but my point is not to teach Python but to demonstrate how much you can accomplish with a trivial amount of typing. (Or copying-and-pasting, which works too!)

We’re not quite done, though. Our merged data is still full of duplicate IDs, because the Con_Code column is still present in our original data.

Line 2 deletes (“drops”) the entire column named Con_Code, and reassigns the altered DataFrame to “df” – essentially, replacing the original df with the new df created by the drop operation.

Now that Con_Code is gone, the “extra” rows are not just duplicates by ID, they are perfect duplicates across the entire row – there is nothing left to make two rows with the same ID unique. We are ready for the final step …

Line 3 deletes (or “drops”) every row that is a duplicate of a previous row.

Having accomplished this, another couple of lines near the end of the script (not shown) will write the data row by row into a new .csv file, which you can then import into your stats package of choice. If you had two dozen different constituent codes in your data, your new file will be wider by two dozen columns … all in the blink of an eye, without any need for Excel or any manual manipulation of the data.

Excel is perfectly capable of pivoting data like we see in the example, but for working with very large data sets and seamlessly merging the pivoted data back into the larger data file, I can’t think of a better tool than Python/pandas. As the data set gets bigger and bigger, the more need there is to stop working with it in tools that go to the extra work of DISPLAYING it. I suppose one of the beauties of Excel is that you can see the data as you are working on it. In fact, as I slowly built up my script, I repeatedly opened the .csv file in Excel just to have that visual inspection of the data to see that I was doing the right thing. But I inevitably reached the point at which the file was just too large for Excel to function smoothly. At 120,000 rows and 185 columns in a 90MB file, it was hardly Big Data – Excel could open the file no problem – but it was large enough that I wouldn’t want to do much filtering or messing with formulas.

On a quick first read, the code in the example above may seem impenetrable to a non-programmer (like me), but you don’t need to memorize a lot of functions and methods to write scripts in Python. Combing the Web for examples of what you want to do, using a lot of cut-and-paste, perhaps referring to a good book now and again – that’s all it takes, really.

That said, it does require time and patience. It took me many hours to cobble together my first script. I re-ran it a hundred times before I tracked down all the errors I made. I think it was worth it, though – every working piece of code is a step in the direction of saving untold hours. A script that works for one task often does not require much modification to work for another. (This cartoon says it all: Geeks and repetitive tasks.)

Beyond data preparation for predictive modelling, there are a number of directions I would like to go with Python, some of which I’ve made progress on already:

  • Merging data from multiple sources into data extract files for use in Tableau … With version 8.0 of the software comes the new Tableau API for building .tde files in Python. This was actually my first experiment with Python scripting. Using the TDE module and a combination of database queries and pandas DataFrames, you can achieve a high degree of automation for refreshing the most complex data sets behind your views and dashboards.
  • Exploring other modelling techniques besides my regular mainstay (regression) … I’ve long been intrigued by stuff such as neural networks, Random Forest, and so on, but I’ve been held back by a lack of time as well as some doubt that these techniques offer a significant improvement over what I’m doing now. Python gives ready access to many of these methods, allowing me to indulge my casual interest without investing a great deal of time. I am not a fan of the idea of automated modelling – the analyst should grasp what is going on in that black box. But I don’t see any harm in some quick-and-dirty experimentation, which could lead to solutions for problems I’m not even thinking of yet.
  • Taking advantage of APIs …. I’d like to try tapping into whatever social networking sites offer in the way of interfaces, and also programmatically access web services such as geocoding via Google.
  • Working with data sets that are too large for high-level applications such as Excel … I recently tried playing with two days’ worth of downloaded geocoded Twitter data. That’s MILLIONS of rows. You aren’t going to be using Excel for that.

I hope I’ve been able to transfer to you some of my enthusiasm for the power and potential of Python. I guess now you’ll be wondering how to get started. That’s not an easy question to answer. I could tell you how to download and install Python and an IDE (an integrated development environment, a user interface in which you may choose write, run, and debug your scripts), but beyond that, so much depends on what you want to do. Python has been extended in a great many directions – pandas for data analysis being just one of them.

However, it wouldn’t hurt to get a feel for how “core Python” works – that is, the central code base of the language along with its data types, object types, and basic operations such as “for” loops. Even before you bother installing anything, go to Codecademy.com and try a couple of the simple tutorials there.

For specific questions Google is your friend, but if you want a reference that covers all the basics in more or less plain English, I like “Learning Python” (4th Edition, but I see there’s a 5th Edition now) by Mark Lutz, published by O’Reilly. Another O’Reilly book, “Python for Data Analysis,” by Wes McKinney, describes how to crunch data with pandas and other related code libraries. (McKinney is the main author of the pandas library.)

I think readers new to programming (like me) will feel some frustration while learning to write their first scripts using any one book or resource. The Lutz book might seem too fine-grained in its survey of the basics for some readers, and McKinney is somewhat terse when offering examples of how various methods work. The problem is not with the books themselves – they’re wonderful. Consider that Python is used in web interfaces, robotics, database programming, gaming, financial markets, GIS, scientific programming, and probably every academic discipline that uses data – you must understand that core texts are perforce very general and abstract. (Think of grammar books for spoken languages.) It’s up to coders themselves to combine the basic building blocks in creative and powerful ways.

That said, after many, many hours spent hopping back and forth between these books, plus online tutorials and Python discussion forums – and just messing around on my own – I have figured out a few useful ways to accomplish some of the more common data preparation tasks that are specific to predictive modelling. Someday I would be happy to share – and, as always, to learn from the experience of others.

Create a free website or blog at WordPress.com.