CoolData blog

26 February 2014

Automate RFM scoring of your donors with this Python script

Filed under: Python, RFM — Tags: , , , — kevinmacdonell @ 9:20 pm

The next time you’re asked to refresh RFM scores for your donor base, tell them it’ll take you the better part of the day. Then fire up this script, output the scores in 30 seconds, and use the rest of your morning to work on something more interesting.

Unlike predictive modeling, which ideally requires an analyst’s good judgement in order to do well, RFM scoring lends itself to being completely automated. Based on a donor’s most recent five years of giving history, an RFM score is composed of three simple elements, which can each be scored 1 to 5. The highest score combination is 5-5-5.

  1. Recency: “5” if the donor’s most recent gift was last year, “4” if it was the year before that, and so on.
  2. Frequency: “5” if the donor gave in five out of five past years, “4” if the donor gave in four out of five, and so on.
  3. Monetary value: “5” if the donor’s total giving in the five years is in the top 20% of the donor file, “4” if total giving is in the next 20%, and so on.

This post follows naturally on my previous post, in which I showed how the PIVOT operator in an SQL statement can be used on donor data to arrange yearly giving totals into columns instead of rows, right at the level of the database query. (See Really swell SQL: Why you must know PIVOT and WITH.)

This Python script includes some sample SQL to pull giving data and pivot it — you will need to modify this SQL to match your schema and table names. If your database does not support PIVOT, then just pull the raw data and research how to pivot the data in Python. In fact, pivots and other data manipulations are much easier to do in Python than in SQL if you use a Python code library called pandas, designed specifically for working with data for analysis. Pandas has structures called DataFrames that are like database tables and are very intuitive to work with.

If you can’t connect directly to your database with this script, then you can just as easily read a .csv file into a DataFrame and work with that instead. The source file has to have one column for IDs, and five columns of yearly giving totals, with nulls for zero totals.

I am not going into a lot of detail about how each pandas function works. You can find as much information as you want via internet searches. (See another previous CoolData post, Getting bitten by Python.) I don’t promise that my code is elegant. Play with it, improve on it, and extend it to fit your needs.

In addition to comments in the body of the script below (anything after a “#”), here’s a limited explanation of what each section of the RFM calculation does:

RECENCY: The script creates five new columns. For each row of data, the field evaluates to True if the field is not null (i.e., if the donor has giving in the year being referenced). In Python, ‘True’ is the same as 1, and ‘False’ is the same as zero. If the donor had giving in the most recent year, the first new column will evaluate to True, and when I multiply ‘True’ by 5, the answer is 5. If no giving, the result is zero, and multiplying by 5 equals zero. The same goes for the next column: If the donor had giving in the year before, the result is True, and when I multiply by 4, the answer is 4. And so on, down to five years ago (‘1’). Then I create a sixth new column, which is populated by the maximum value found in the previous five columns — this ends up being the Recency portion of the score. Finally, I delete (or “drop”) the previous five columns, as they are no longer needed.

FREQUENCY: This part is way simpler … just a count of the number of non-null values in the five columns of yearly giving totals.

MONETARY VALUE: First, we sum on giving for all five years. Then these totals are chopped up into quintiles, ignoring any null totals. (Quintiles means five equal-sized groups, each making up 20% of the total number of records.) Each quintile is given a label (1 to 5), with the top quintile being a “5”.

That’s it. The results are saved to a .csv file, which looks like this:

output

 

 

# import required code modules.
import pandas as pd
import pandas.io.sql as psql
import cx_Oracle   # for querying an Oracle database
import os  # for reading and writing files to your hard drive

# Prompt user for database username and password.
username = raw_input('Enter username: ')
password = raw_input('Enter password: ')

# Pass text of SQL statement to variable called SQL, enclosed in triple quotes.
# (Note the pivot ... five years of giving data.)

SQL = '''
SELECT *

FROM (
 SELECT TABLE_A.ID,
 SUM ( TABLE_B.GIFT_AMT ) AS GIVING,
 TABLE_B.GIFT_FISC_CODE AS FY

 FROM
 TABLE_A,
 TABLE_B

 WHERE
 (TABLE_A.ID = TABLE_B.ID
  AND TABLE_B.GIFT_FISC_CODE IN (2010, 2011, 2012, 2013, 2014))

 GROUP BY
 TABLE_A.ID,
 TABLE_B.GIFT_FISC_CODE
 )

PIVOT (
  SUM ( GIVING ) AS total FOR ( FY )
  IN (2010 as FY2010, 2011 as FY2011, 2012 as FY2012, 2013 as FY2013, 2014 as FY2014)
  )

'''

# Connect, execute the SQL, and put data into a pandas DataFrame (df)
# "@dsn" (data source name) is the TNS entry (from the Oracle names server or tnsnames.ora file)
# "0000" is the port number

connectstr = username + '/' + password + '@dsn:0000'
connection = cx_Oracle.connect(connectstr)
df = psql.frame_query(SQL, con=connection) 

## RFM score

# RECENCY
# Create five new columns.
df['Recency5'] = df.FY2014_TOTAL.notnull() * 5
df['Recency4'] = df.FY2013_TOTAL.notnull() * 4
df['Recency3'] = df.FY2012_TOTAL.notnull() * 3
df['Recency2'] = df.FY2011_TOTAL.notnull() * 2
df['Recency1'] = df.FY2010_TOTAL.notnull() * 1

# Create a sixth column, setting it to the max value of the previous five columns.
df['Recency'] = df[['Recency5', 'Recency4', 'Recency3', 'Recency2', 'Recency1']].max(axis=1)

# Delete the five original columns - no longer needed.
df = df.drop(['Recency5', 'Recency4', 'Recency3', 'Recency2', 'Recency1'], axis=1)

# FREQUENCY

# Create new column, and set it to the number of non-null values in the giving totals columns.
df['Frequency'] = df[['FY2010_TOTAL', 'FY2011_TOTAL', 'FY2012_TOTAL', 'FY2013_TOTAL', 'FY2014_TOTAL']].count(axis=1)

# MONETARY VALUE

# Create new column, the sum of all giving years.
df['Giving_Total'] = df[['FY2010_TOTAL', 'FY2011_TOTAL', 'FY2012_TOTAL', 'FY2013_TOTAL', 'FY2014_TOTAL']].sum(axis=1, skipna=True)

# Break all giving totals into five quintiles (ignoring any null values), and store the label (1 to 5) in another new column.
quintile_bins = [1, 2, 3, 4, 5]
df['Monetary'] = pd.qcut(df['Giving_Total'], 5, labels = quintile_bins)

# Write results to a .csv file, and display number of saved records to user.
df.to_csv('RFM_results.csv')
print(str(len(df)) + " records output")
Advertisements

Create a free website or blog at WordPress.com.

%d bloggers like this: