CoolData blog

25 March 2014

An all-SQL way to automate RFM scoring

Filed under: RFM, SQL — Tags: , , , — kevinmacdonell @ 8:44 pm

In my last post I described how to automate RFM scoring using the scripting language Python and pandas. Automating RFM scoring is a great way to learn a bit of Python — but on its own it’s a lousy reason for learning Python. It was pointed out to me that you don’t really need to call on the power of Python if you’re already comfortable with (or learning to use) SQL, and I agree.

Shortly after publishing the Python post, I got an email from Bruce Higgins, who works in the Alumni Affairs & Development office at Cornell University. His all-SQL solution is shown below. I leave it to readers to figure out how to make it work for them. In the wake of my post there was some great discussion and code-sharing on the Prospect-DMM discussion list, and I encourage you to look up that thread.

You’ll notice that Bruce has hard-coded the dollar values used for scoring “Monetary value” instead of dynamically dividing the file into quintiles. As he points out, if you’re scoring the same database over and over, it’s not likely these dollar thresholds are going to change much over time. You will of course need to make adjustments for your own donor data.

WITH Five_years AS
(SELECT id,
 SUM((gift_amount)) AS cash,
 MAX(decode(.gift_fiscalyear, 2014, 5, 2013, 4, 2012, 3, 2011, 2, 2010, 1)) AS recency,
 COUNT(DISTINCT(gift_fiscalyear)) AS frequency
 FROM
(SQL specifying the last five years cash transactions from our warehouse)
GROUP BY id
SELECT Five_years.id,
CASE
WHEN Five_years.cash >= 10000 THEN 5
WHEN Five_years.cash >= 2500 THEN 4
WHEN Five_years.cash >= 1000 THEN 3
WHEN Five_years.cash >= 500 THEN 2
ELSE 1
END + Five_years.recency + Five_years.frequency AS rfm_score
FROM Five_years
ORDER BY five_years.name
Advertisements

26 February 2014

Automate RFM scoring of your donors with this Python script

Filed under: Python, RFM — Tags: , , , — kevinmacdonell @ 9:20 pm

The next time you’re asked to refresh RFM scores for your donor base, tell them it’ll take you the better part of the day. Then fire up this script, output the scores in 30 seconds, and use the rest of your morning to work on something more interesting.

Unlike predictive modeling, which ideally requires an analyst’s good judgement in order to do well, RFM scoring lends itself to being completely automated. Based on a donor’s most recent five years of giving history, an RFM score is composed of three simple elements, which can each be scored 1 to 5. The highest score combination is 5-5-5.

  1. Recency: “5” if the donor’s most recent gift was last year, “4” if it was the year before that, and so on.
  2. Frequency: “5” if the donor gave in five out of five past years, “4” if the donor gave in four out of five, and so on.
  3. Monetary value: “5” if the donor’s total giving in the five years is in the top 20% of the donor file, “4” if total giving is in the next 20%, and so on.

This post follows naturally on my previous post, in which I showed how the PIVOT operator in an SQL statement can be used on donor data to arrange yearly giving totals into columns instead of rows, right at the level of the database query. (See Really swell SQL: Why you must know PIVOT and WITH.)

This Python script includes some sample SQL to pull giving data and pivot it — you will need to modify this SQL to match your schema and table names. If your database does not support PIVOT, then just pull the raw data and research how to pivot the data in Python. In fact, pivots and other data manipulations are much easier to do in Python than in SQL if you use a Python code library called pandas, designed specifically for working with data for analysis. Pandas has structures called DataFrames that are like database tables and are very intuitive to work with.

If you can’t connect directly to your database with this script, then you can just as easily read a .csv file into a DataFrame and work with that instead. The source file has to have one column for IDs, and five columns of yearly giving totals, with nulls for zero totals.

I am not going into a lot of detail about how each pandas function works. You can find as much information as you want via internet searches. (See another previous CoolData post, Getting bitten by Python.) I don’t promise that my code is elegant. Play with it, improve on it, and extend it to fit your needs.

In addition to comments in the body of the script below (anything after a “#”), here’s a limited explanation of what each section of the RFM calculation does:

RECENCY: The script creates five new columns. For each row of data, the field evaluates to True if the field is not null (i.e., if the donor has giving in the year being referenced). In Python, ‘True’ is the same as 1, and ‘False’ is the same as zero. If the donor had giving in the most recent year, the first new column will evaluate to True, and when I multiply ‘True’ by 5, the answer is 5. If no giving, the result is zero, and multiplying by 5 equals zero. The same goes for the next column: If the donor had giving in the year before, the result is True, and when I multiply by 4, the answer is 4. And so on, down to five years ago (‘1’). Then I create a sixth new column, which is populated by the maximum value found in the previous five columns — this ends up being the Recency portion of the score. Finally, I delete (or “drop”) the previous five columns, as they are no longer needed.

FREQUENCY: This part is way simpler … just a count of the number of non-null values in the five columns of yearly giving totals.

MONETARY VALUE: First, we sum on giving for all five years. Then these totals are chopped up into quintiles, ignoring any null totals. (Quintiles means five equal-sized groups, each making up 20% of the total number of records.) Each quintile is given a label (1 to 5), with the top quintile being a “5”.

That’s it. The results are saved to a .csv file, which looks like this:

output

 

 

# import required code modules.
import pandas as pd
import pandas.io.sql as psql
import cx_Oracle   # for querying an Oracle database
import os  # for reading and writing files to your hard drive

# Prompt user for database username and password.
username = raw_input('Enter username: ')
password = raw_input('Enter password: ')

# Pass text of SQL statement to variable called SQL, enclosed in triple quotes.
# (Note the pivot ... five years of giving data.)

SQL = '''
SELECT *

FROM (
 SELECT TABLE_A.ID,
 SUM ( TABLE_B.GIFT_AMT ) AS GIVING,
 TABLE_B.GIFT_FISC_CODE AS FY

 FROM
 TABLE_A,
 TABLE_B

 WHERE
 (TABLE_A.ID = TABLE_B.ID
  AND TABLE_B.GIFT_FISC_CODE IN (2010, 2011, 2012, 2013, 2014))

 GROUP BY
 TABLE_A.ID,
 TABLE_B.GIFT_FISC_CODE
 )

PIVOT (
  SUM ( GIVING ) AS total FOR ( FY )
  IN (2010 as FY2010, 2011 as FY2011, 2012 as FY2012, 2013 as FY2013, 2014 as FY2014)
  )

'''

# Connect, execute the SQL, and put data into a pandas DataFrame (df)
# "@dsn" (data source name) is the TNS entry (from the Oracle names server or tnsnames.ora file)
# "0000" is the port number

connectstr = username + '/' + password + '@dsn:0000'
connection = cx_Oracle.connect(connectstr)
df = psql.frame_query(SQL, con=connection) 

## RFM score

# RECENCY
# Create five new columns.
df['Recency5'] = df.FY2014_TOTAL.notnull() * 5
df['Recency4'] = df.FY2013_TOTAL.notnull() * 4
df['Recency3'] = df.FY2012_TOTAL.notnull() * 3
df['Recency2'] = df.FY2011_TOTAL.notnull() * 2
df['Recency1'] = df.FY2010_TOTAL.notnull() * 1

# Create a sixth column, setting it to the max value of the previous five columns.
df['Recency'] = df[['Recency5', 'Recency4', 'Recency3', 'Recency2', 'Recency1']].max(axis=1)

# Delete the five original columns - no longer needed.
df = df.drop(['Recency5', 'Recency4', 'Recency3', 'Recency2', 'Recency1'], axis=1)

# FREQUENCY

# Create new column, and set it to the number of non-null values in the giving totals columns.
df['Frequency'] = df[['FY2010_TOTAL', 'FY2011_TOTAL', 'FY2012_TOTAL', 'FY2013_TOTAL', 'FY2014_TOTAL']].count(axis=1)

# MONETARY VALUE

# Create new column, the sum of all giving years.
df['Giving_Total'] = df[['FY2010_TOTAL', 'FY2011_TOTAL', 'FY2012_TOTAL', 'FY2013_TOTAL', 'FY2014_TOTAL']].sum(axis=1, skipna=True)

# Break all giving totals into five quintiles (ignoring any null values), and store the label (1 to 5) in another new column.
quintile_bins = [1, 2, 3, 4, 5]
df['Monetary'] = pd.qcut(df['Giving_Total'], 5, labels = quintile_bins)

# Write results to a .csv file, and display number of saved records to user.
df.to_csv('RFM_results.csv')
print(str(len(df)) + " records output")

16 July 2013

Alumni engagement scoring vs. predictive modelling

Filed under: Alumni, engagement, predictive modeling — Tags: , , , — kevinmacdonell @ 8:06 am

Alumni engagement scoring has an undeniable appeal. What could be simpler? Just add up how many events an alum has attended, add more points for volunteering, add more points for supporting the Annual Fund, and maybe some points for other factors that seem related to engagement, and there you have your score. If you want to get more sophisticated, you can try weighting each score input, but generally engagement scoring doesn’t involve any advanced statistics and is easily grasped.

Not so with predictive modelling, which does involve advanced stats and isn’t nearly as intuitive; often it’s not possible to really say how an input variable is related to the outcome. It’s tempting, too, to think of an engagement score as being a predictor of giving and therefore a good replacement for modelling. Actually, it should be predictive — if it isn’t, your score is not measuring the right things — but an engagement score is not the same thing as a predictive model score. They are different tools for different jobs.

Not only are engagement scoring schemes different from predictive models, their simplicity is deceptive. Engagement scoring is incomplete without some plan for acting on observed trends with targeted programming. This implies the ability to establish causal drivers of engagement, which is a tricky thing.

That’s a sequence of events — not a one-time thing. In fact, engagement scoring is like checking the temperature at regular intervals over a long period of time, looking for up and down trends not just for the group as a whole but via comparisons of important subgroups defined by age, sex, class year, college, degree program, geography or other divisions. This requires discipline: taking measurements in exactly the same way every year (or quarter, or what-have-you). If the score is fed by a survey component, you must survey constantly and consistently.

Predictive models and engagement scores have some surface similarities. They share variables in common, the output of both is a numerical score applied to every individual, and both require database work and math in order to calculate them. Beyond that, however, they are built in different ways and for different purposes. To summarize:

  • Predictive models are collections of potentially dozens of database variables weighted according to strength of correlation with a well-defined behaviour one is trying to predict (eg. making a gift), in order to rank individuals by likelihood to engage in that behaviour. Both Alumni Relations and Development can benefit from the use of predictive models.
  • Engagement scores are collections of a very few selectively-chosen database variables, either not weighted or weighted according to common sense and intuition, in order to roughly quantify the quality of “engagement”, however one wishes to define that term, for each individual. The purpose is to allow comparison of groups (faculties, age bands, geographical regions, etc.) with each other. Comparisons may be made at one point in time, but it is more useful to compare relative changes over time. The main user of scores is Alumni Relations, in order to identify segments requiring targeted programming, for example, and to assess the impact of programming on targeted segments over time.

Let’s explore key differences in more depth:

The purpose of modelling is prediction, for ranking or segmentation. The purpose of engagement scoring is comparison.

Predictive modelling scores are not usually included in reports. Used immediately in decision making, they may never be seen by more than one or two people. Engagement scores are included in reports and dashboards, and influence decision-making over a long span of time.

The target variable of a predictive model is quantifiable (eg. giving, measurable in dollars). In engagement scoring, there is no target variable, only an output – a construct called “engagement”, which itself is not directly measurable.

Potential input variables for predictive models are numerous (100+) and vary from model to model. Input variables for engagement scores are limited to a handful of easily measured attributes (giving, event attendance, volunteering) which must remain consistent over time.

Variables for predictive models are chosen primarily using statistical methods (correlation) and only secondarily using judgment and “common sense.” For example, if the presence of a business phone number is highly correlated with being a donor, it may be included in the model. For engagement scores, variables are chosen by consensus of stakeholders, primarily according to subjective standards. For example, event attendance and giving would probably be deemed by the committee to indicate engagement, and would therefore be included in the score. Advanced statistics rarely come into play. (For more thoughts on this, read How you measure alumni engagement is up to you.)

In predictive models, giving and variables related to the activity of giving are usually excluded as variables (if ‘giving’ is what we are trying to predict). Using any aspect of the target variable as an input is bad practice in predictive modelling and is carefully avoided. You wouldn’t, for example, use attendance at a donor recognition event to predict likelihood to give. In engagement scoring, though, giving history is usually a key input, as it is common sense to believe that being a donor is an indication of engagement. (It might be excluded or reported separately if the aim is to demonstrate the causal link between engagement indicators and giving.)

Modelling variables are weighted using multiple linear regression or other statistical method which calculates the relative influence of each variable while simultaneously controlling for the influence of all other variables in the model. Engagement score variables are usually weighted according to gut feel. For example, coming to campus for Homecoming seems to carry more weight than showing up for a pub night in one’s own city, therefore we give it more weight.

The quality of a predictive model is testable, first against a validation data set, and later against actual results. But there is no right or wrong way to estimate engagement, therefore the quality of scores cannot be evaluated conclusively.

The variables in a predictive model have complex relationships with each other that are difficult or impossible to explain except very generally. Usually there is no reason to explain a model in detail. The components in an engagement score, on the other hand, have plausible (although not verifiable) connections to engagement. For example, volunteering is indicative of engagement, while Name Prefix is irrelevant.

Predictive models are built for a single, time-limited purpose and then thrown away. They evolve iteratively and are ever-changing. On the other hand, once established, the method for calculating an engagement score must not change if comparisons are to be made over time. Consistency is key.

Which is all to say: alumni engagement scoring is not predictive modelling. (And neither is RFM analysis.) Only predictive modelling is predictive modelling.

Blog at WordPress.com.