CoolData blog

5 December 2016

Amazing things with matching strings

Filed under: Coolness, Data integrity, SQL — Tags: , , , — kevinmacdonell @ 7:44 am

 

I had an occasion recently when it would have been really helpful to know that a new address added to the database was a duplicate of an older, inactivated address. The addition wasn’t identified as a duplicate because it wasn’t a perfect match — a difference similar to that between 13 Anywhere Road and 13 Anywhere Drive. 

 

After the fact, I did a Google search and discovered some easy-to-use functionality in Oracle SQL that might have saved us some trouble. Today I want to talk about how to use UTL_MATCH and  suggest some cool applications for it in Advancement data work.

 

“Fuzzy matching” is the term used for identifying pairs of character strings that may not be exactly the same, but are so close that they could be. For example, “Washignton” is one small typo away from “Washington,” but the equivalence is very difficult to detect by any means other than an alert pair of human eyes scanning a sorted list. When the variation occurs at the beginning of a string — “Unit 3, 13 Elm St.” instead of “Apmt 3, 13 Elm St.” — then even a sorted list is of no use.

 

According to this page, the UTL_MATCH package was introduced in Oracle 10g Release 2, but first documented and supported in Oracle 11g Release 2. The package includes two functions for testing the level of similarity or difference between strings.

 

The first function is called EDIT_DISTANCE, which is a count of the number of “edits” to get from one string to a second string. For example, the edit distance from “Kevin” to “Kelvin” is 1, for “New York” to “new york” is 2, and from “Hello” to “Hello” is 0. (A related function, EDIT_DISTANCE_SIMILARITY, expresses the distance as a normalized value between 0 and 100 — 100 being a perfect match.)

 

The second method, the one I’ve been experimenting with, is called JARO_WINKLER, named for an algorithm that measures the degree of similarity between two strings. The result ranges between 0 (no similarity) to 1 (perfect similarity). It was designed specifically for detecting duplicate records, and its formula seems aimed at the kind of character transpositions you’d expect to encounter in data entry errors. (More info here: Jaro-Winkler distance.)

 

Like EDIT_DISTANCE, it has a related function called JARO_WINKLER_SIMILARITY. Again, this ranges from 0 (no match) to 100 (perfect match). This is the function I will refer to for the rest of this post.

 

Here is a simple example of UTL_MATCH in action. The following SQL scores constituents in your database according to how similar their first name is to their last name, with the results sorted in descending order by degree of similarity. (Obviously, you’ll need to replace “schema”, “persons,” and field names with the proper references from your own database.)

 

SELECT

t1.ID,

t1.first_name,

t1.last_name,

UTL_MATCH.jaro_winkler_similarity(t1.first_name, t1.last_name) AS jw

FROM schema.persons t1

ORDER BY jw DESC

 

Someone named “Donald MacDonald” would get a fairly high value for JW, while “Kevin MacDonell” would score much lower. “Thomas Thomas” would score a perfect 100.

 

Let’s turn to a more useful case: Finding potential duplicate persons in your database. This entails comparing a person’s full name with the full name of everyone else in the database. To do that, you’ll need a self-join.

 

In the example below, I join the “persons” table to itself. I concatenate first_name and last_name to make a single string for the purpose of matching. In the join conditions, I exclude records that have the same ID, and select records that are a close or perfect match (according to Jaro-Winkler). To do this, I set the match level at some arbitrary high level, in this case greater than or equal to 98.

 

SELECT

t1.ID,

t1.first_name,

t1.last_name,

t2.ID,

t2.first_name,

t2.last_name,

UTL_MATCH.jaro_winkler_similarity ( t1.first_name || ' ' || t1.last_name, t2.first_name || ' ' || t2.last_name ) AS jw

FROM schema.persons t1

INNER JOIN schema.persons t2 ON t1.ID != t2.ID AND UTL_MATCH.jaro_winkler_similarity ( t1.first_name || ' ' || t1.last_name, t2.first_name || ' ' || t2.last_name ) >= 98

ORDER BY jw DESC

 

I would suggest reading this entire post before trying to implement the example above! UTL_MATCH presents some practical issues which limit what you can do. But before I share the bad news, here are some exciting possible Advancement-related applications:

 

  • Detecting duplicate records via address matching.
  • Matching external name lists against your database. (Which would require the external data be loaded into a temporary table in your data warehouse, perhaps.)
  • Screening current and incoming students against prospect, donor, and alumni records for likely matches (on address primarily, then perhaps also last name).
  • Data integrity audits. An example: If the postal code or ZIP is the same, but the city name is similar (but not perfectly similar), then there may be an error in the spelling or capitalization of the city name.
  • Searches on a particular name. If the user isn’t sure about spelling, this might be one way to get suggestions back that are similar to the guessed spelling.

 

Now back to reality … When you run the two code examples above, you will probably find that the first executes relatively quickly, while the second takes a very long time or fails to execute at all. That is due to the fact that you’re evaluating each record in the database against every other record. This is what’s known as a cross-join or Cartesian product — a very costly join which is rarely used. If you try to search for matches across 100,000 records, that’s 10 billion evaluations! The length of the strings themselves contributes to the complexity, and therefore the runtime, of each evaluation — but the real issue is the 10,000,000,000 operations.

 

As intriguing as UTL_MATCH is, then, its usage will cause performance issues. I am still in the early days of playing with this, but here are a few things I’ve learned about avoiding problems while using UTL_MATCH.

 

Limit matching records. Trying to compare the entire database with itself is going to get you in trouble. Limit the number of records retrieved for comparison. A query searching for duplicates might focus solely on the records that have been added or modified in the past day or two, for example. Even so, those few records have to be checked against all existing records, so it’s still a big job — consider not checking against records that are marked deceased, that are non-person entities, and so on. Anything to cut down on the number of evaluations the database has to perform.

 

Keep strings short. Matching works best when working with short strings. Give some thought to what you really want to match on. When comparing address records, it might make sense to limit the comparison to Street Line 1 only, not an entire address string which could be quite lengthy.

 

Pre-screen for perfect matches: A Jaro-Winkler similarity of 100 means that two strings are exactly equal. I haven’t tested this, but I’m guessing that checking for A = B is a lot faster than calculating the JW similarity between A and B. It might make sense to have one query to audit for perfect matches (without the use of UTL_MATCH) and exclude those records from a second query that audits for JW similarities that are high but less than a perfect 100.

 

Pre-screen for impossible matches. If a given ID_1 has a street address than is 60 characters long and a given ID_2 has a street address that is only 20 characters long, there is no possibility of a high Jaro-Winkler score and therefore no need to calculate it. Find a way to limit the data set to match before invoking UTL_MATCH, possibly through the use of a WITH clause that limits potential matching pairs by excluding any that differ in length by more than, say, five characters. (Another “pre-match” to use would check if the initial letter in a name is the same; if it isn’t, good chance it isn’t going to be a match.)

 

Keep match queries simple. Don’t ask for fields other than ID and the fields you’re trying to match on. Yes, it does make sense to bring down birthdate and additional address information so that the user can decide if a probable match is a true duplicate or not, but keep that part of the query separate from the match itself. You can do this by putting the match in a WITH clause, and then left-joining additional data to the results of that clause.

 

Truth be told, I have not yet written a query that does something useful while still executing in a reasonable amount of time, simply due to the sheer number of comparisons being made. I haven’t given up on SQL, but it could be that duplicate detection is better accomplished via a purpose-built script running on a standalone computer that is not making demands on an overburdened database or warehouse (aside from the initial pull of raw data for analysis).

 

The best I’ve done so far is a query that selects address records that were recently modified and matches them against other records in the database. Before it applies Jaro-Winkler, the query severely limits the data by pairing up IDs that have name strings and address strings that are nearly the same number of characters long. The query has generated a few records to investigate and, if necessary, de-dupe — but it takes more than an hour and half to run.

 

Have any additional tips for making use of UTL_MATCH? I’d love to hear and share. Email me at kevin.macdonell@gmail.com.

 

Advertisements

13 June 2016

Nifty SQL regression to calculate donors’ giving trends

Filed under: Coolness, Predictor variables, regression, SQL — Tags: , , , — kevinmacdonell @ 8:28 pm

 

Here’s a nifty bit of SQL that calculates a best-fit line through a donor’s years of cash-in giving by fiscal year (ignoring years with no giving), and classifies that donor in terms of how steeply they are “rising” or “falling”.

 

I’ll show you the sample code, which you will obviously have to modify for your own database, and then talk a little bit about how I tested it. (I know this works in Oracle version 11g. Not sure about earlier versions, or other database systems.)

 

with sums AS (
 select t1.id, t1.fiscal_year, log(10, sum(t1.amount)) AS yr_sum
 from gifts t1
 group by t1.id, t1.fiscal_year),

slopes AS (
 select distinct
 sums.id,
 regr_slope(sums.yr_sum,sums.fiscal_year) OVER (partition by sums.id) AS slope

from sums
 )

select
 slopes.id,
 slopes.slope,
 CASE
 when slopes.slope is null then 'Null'
 when slopes.slope >=0.1 then 'Steeply Rising'
 when slopes.slope >=0.05 then 'Moderately Rising'
 when slopes.slope >=0.01 then 'Slightly Rising'
 when slopes.slope >-0.01 then 'Flat'
 when slopes.slope >-0.05 then 'Slightly Falling'
 when slopes.slope >-0.1 then 'Moderately Falling'
 else 'Steeply Falling' end AS description

from slopes
That’s it. Not a lot of SQL, and it runs very quickly (for me). But does it actually tell us anything?

 

I devised a simple test. Adapting this query, I calculated the “slope of giving” for all donors over a five-year period in the past: FY 2007 to FY 2011. I wanted to see if this slope could predict whether, and by how much, a donor’s giving would rise or fall in the next five-year period: FY 2012 to FY 2016. (Note that the sum of a donor’s giving in each year is log-transformed, in order to better handle outlier donors with very large giving totals.)

 

I assembled a data file with each donor’s sum of cash giving for the first five-year period, the slope of their giving in that period, and the sum of their cash giving for the five-year period after that.

 

The first test was to see how the categories of slope, from Steeply Rising to Steeply Falling, translated into subsequent rises and falls. In Data Desk, I compared the two five-year periods. If the second period’s giving was greater than the first, I called that a “rise.” If it was less, I called it a “fall.” And if it was exactly the same, I called it “Same.”

 

The table below summarizes the results. Note that these numbers are all percentages, summed horizontally. (I will explain the colour highlighting later on.)

 

contingency1

 

For Steeply Rising, 60.6% of donors actually FELL from the first period to the next. Only 37.8 percent rose, and just 1.6% stayed exactly the same. Not terribly impressive. Look at Steeply Falling, though: More than three-quarters actually did fall. That’s a better result, but then again, “Falling” dominates for every category; in the whole file, close to 70% of all donors reduced their giving in the next period. If a donor has no giving in the second period of five years, that’s zero dollars given, and this is called a “Fall” — more on that aspect in just a sec.

 

(I’ve left out donors with a FY2007-11 slope of Null — they’re the ones who gave in only one year and therefore don’t have a “slope”.)

 

Let’s not give up just yet, however. The colour highlighting indicates how high each percentage value is in relation to those above and below it. For example, the highest percentages in the Falling column are found in the Slightly, Moderately, and especially Steeply Falling slope categories. The highest percentages in the Rising column are in the Slightly, Moderately, and Steeply Rising slope categories. And in the Same column, the Flat slope wins hands-down — as we would hope.

 

So a rising slope “sort of” predicts increased giving, a falling slope “sort of” predicts decreased giving. Unfortunately, many donors are not retained into the second five-year period, so there’s not a lot to be confident about.

 

But what if a donor IS retained? What if we exclude the lapsed donors entirely? Let’s do that:

 

contingency2

 

Excluding non-donors seems to lead to an improvement … The slope does a better job sorting between the risers and fallers when a donor is actually retained. Again, the colour highlighting is referencing columns, not rows. But notice now that, across the rows, Rising has a slight majority for the Rising slope categories, and Falling has a slight majority for the Falling slope categories. (The bar is set too high for Flat, however, given that a donor’s giving in the first five years has to be exactly equal to her giving in the second five years to be called Same.)

 

Admittedly, these majorities are not generous. If I calculated a donor’s slope of giving as Steeply Rising and that donor was retained, I have only a 56.4% chance of actually being right. And of course there’s no guarantee that donor won’t lapse.

 

(Note that these are donors of all types — alumni, non-alumni individuals, and entities such as corporations and foundations. Non-alumni donors tend not to have patterns in their giving that are repeated, not to the extent that alumni do. However, when I limit the data file to alumni donors only, the improvement in this method is only very slight.)

 

Pressing on … I did a regression analysis using total giving in the second five-year period as the dependent variable, then entered total giving in the prior five-year period as an independent variable. (Naturally, R-squared was very high.) This allowed me to see if Slope provides any explanatory power when it is added as the second independent variable — the effect of giving in the first five-year period already being accounted for.

 

And the answer is, yes, it does. But only under specific conditions: Both five-year giving totals were log-transformed and, most significantly, donors who did not give in the second period were excluded from the regression.

 

There are other way to assess the usefulness of “slope” which might lead to an application, and I encourage you to give this a try with your own data. From past experience I know that donors who make big upgrades in giving don’t have any neat universal pattern such as an upward slope in their giving history. (The concept of volatility is explored here and here.) “Slope” is probably too simple a characteristic to employ on its own.

 

But as I’ve said before, if it were easy, obvious, or intuitive, it wouldn’t be data analysis.

 

20 July 2014

Eliminate your duplicate data row problems with simple SQL

Filed under: Data, SQL — Tags: , , , , , , , — kevinmacdonell @ 1:38 pm

We’ve all faced this problem: Duplicate rows in our data. It messes up our reports and causes general confusion and consternation for people we supply data to. The good news is you can take care of this problem fairly easily in an SQL query using a handy function called LISTAGG. I discovered this almost by accident just this week, and I’m delighted to share.

Duplicate rows are a result of what we choose to include in our database query. If we are expecting to get only one row for each constituent ID, then we should be safe in asking for Gender, or Name Prefix, or Preferred Address. There should be only one of each of these things for each ID, and therefore there won’t be duplicate rows. These are examples of one-to-one relationships.

We get into trouble when we ask for something that can have multiple values for a single ID. That’s a one-to-many relationship. Examples: Any address (not just Preferred), Degree (some alumni will have more than one), Category Code (a person can be an alum and a parent and a staff member, all at the same time) … and so on. Below, the first constituent is duplicated on Category, and the second constituent is duplicated on Degree.

dup1

That’s not the worst of it; when one ID is duplicated on two or more elements, things get really messy. If A0001 above had three Category codes and two degrees, it would be appear six times (3 x 2) in the result set. Notice that we are really talking about duplicate IDs — not duplicate rows. Each row is, in fact, unique. This is perfectly correct behaviour on the part of the database and query. Try explaining that to your boss, however! You’ve got to do something to clean this up.

What isn’t going to work is pivoting. In SQL, the PIVOT function creates a new column for each unique category of some data element, plus an aggregation of the values for that category. I’ve written about this here: Really swell SQL: Why you must know PIVOT and WITH. Using PIVOT is going to give you something like the result below. (I’ve put in 1’s and nulls to indicate the presence or absence of data.)

dup2

 

What we really want is THIS:

dup3

The answer is this: String aggregation. You may already be familiar with aggregating quantitative values, usually by counting or summing. When we query on a donor’s giving, we usually don’t want a row for each gift — we want the sum of all giving. Easy. For strings — categorical data such as Degree or Category Code — it’s a different situation. Sure, we can aggregate by using counting. For example, I can ask for a count of Degree, and that will limit the data to one row per ID. But for most reporting scenarios, we want to display the data itself. We want to know what the degrees are, not just how many of them there are.

The following simplified SQL will get you the result you see above for ID and Category — one row per ID, and a series of category codes separated by commas. This code works in Oracle 11g — I can’t speak to other systems’ implementations of SQL. (For other Oracle-based methods, check out this page. To accomplish the same thing in a non-Oracle variant of SQL called T-SQL, scroll down to the Postscript below.)

Obviously you will need to replace the sample text with your own schema name, table names, and field identifiers.

SELECT
SCHEMA.ENTITY.ID,
LISTAGG ( SCHEMA.ENTITY.CATG_CODE, ', ' )
WITHIN GROUP ( ORDER BY SCHEMA.ENTITY.CATG_CODE ) AS CATEGORY

FROM SCHEMA.ENTITY

GROUP BY SCHEMA.ENTITY.ID

The two arguments given to LISTAGG are the field CATG_CODE and a string consisting of a comma and a space. The string, which can be anything you want, is inserted between each Category Code. On the next line, ORDER BY sorts the category codes in alphabetical order.

LISTAGG accepts only two arguments, but if you want to include two or more fields and/or strings in the aggregation, you can just concatenate them inside LISTAGG using “pipe” characters (||). The following is an example using Degree Code and Degree Code Description, with a space between them. To get even fancier, I have replaced the comma separator with the character code for “new line,” which puts each degree on its own line — handy for readability in a list or report. (This works fine when used in Custom SQL in Tableau. For display in Excel, you may have to use a carriage return — char(13) — instead of a line feed.) You will also notice that I have specified a join to a table containing the Degree data.

SELECT
SCHEMA.ENTITY.ID,
LISTAGG ( SCHEMA.DEGREE.DEGC_CODE || ' ' || SCHEMA.DEGREE.DEGC_DESC, chr(10) )
WITHIN GROUP ( ORDER BY SCHEMA.DEGREE.DEGC_CODE, SCHEMA.DEGREE.DEGC_DESC ) AS DEGREES

FROM SCHEMA.ENTITY

INNER JOIN 
SCHEMA.DEGREE 
ON ( SCHEMA.ENTITY.ID = SCHEMA.DEGREE.ID )

GROUP BY SCHEMA.ENTITY.ID

dup_final

Before I discovered this capability, the only other automated way I knew how to handle it was in Python. All the previous methods I’ve used were at least partially manual or just very cumbersome. I don’t know how long string aggregation in Oracle has been around, but I’m grateful that a data problem I have struggled with for years has finally been dispensed with. Hurrah!

~~~~~~

POSTSCRIPT

After publishing this I had an email response from David Logan, Director of Philanthropic Analytics at Children’s Mercy in Kansas City, Missouri. He uses TSQL, a proprietary procedural language used by Microsoft in SQL Server. Some readers might find this helpful, so I will quote from his email:

I was reading your post this morning about using the LISTAGG function in Oracle SQL to concatenate values from duplicate rows into a list. A couple of months ago I was grappling with this very problem so that I could create a query of Constituents that included a list of all their phone numbers in a single field. LISTAGG is not available for those of us using T-SQL … the solution for T-SQL is to use XML PATH().

Here’s a link to a source that demonstrates how to use it: http://sqlandme.com/2011/04/27/tsql-concatenate-rows-using-for-xml-path/.

One key difference in this method from LISTAGG is that the delimiter character (such as the comma—or characters as in your example where you have comma space) is added before each value returned so the raw string returned will have this character before the first value. You can use the STUFF() function to eliminate this first delimiter.

Here’s my solution using XML PATH() to return a list of phone numbers from our donor database:

SELECT
CA.ID,
STUFF
(
  (
  SELECT
   '; ' + PH.NUM + ' [' + re7_ReportingTools.dbo.GetLongDescription(PH.PHONETYPEID) + ']'
  FROM
   re7_db.dbo.CONSTIT_ADDRESS_PHONES CAP
    INNER JOIN
     re7_db.dbo.PHONES PH
     ON CAP.PHONESID=PH.PHONESID
  WHERE
     CAP.CONSTITADDRESSID=CA.ID
    AND
     PH.DO_NOT_CALL=0
  ORDER BY CAP.SEQUENCE
  FOR XML PATH('')
  ),1,2,''
) PHONENUMS
FROM
re7_db.dbo.CONSTIT_ADDRESS CA
 
GROUP BY CA.ID

I concatenated my list using a semi-colon followed by a space ‘; ‘. I used the STUFF() function to remove the leading semi-colon and space.

Here are a few sample rows that are returned (note I’ve “greeked” the prefixes for donor privacy):

table

16 May 2014

Find out “who knows who” in your database with SQL and Python

Filed under: Fun, Prospect identification, Python, SQL — Tags: , , , , — kevinmacdonell @ 12:30 pm

Figuring out connections is a big part of major gift fundraising. A development office might have a great list of prospective donors in hand, but those prospects will languish untouched if the right people who can make introductions are never found. While cultivating relationships with well-connected people is clearly important, research also plays a role in suggesting possible go-betweens. And to the extent that research involves sifting through data to find connections, it’s possible to automate some of the work.

Wouldn’t it be great to have software that asks you to name a prospective donor, and then spits out a list of people in your database who are most likely to be associates of that person? Sure it would!

Social media applications do a version of this, suggesting “people you may know,” based on shared connections and attributes such as schools attended, professional group memberships, employment information, and stated interests. I first encountered a higher-ed advancement application of this concept via Erich Preisendorfer of the University of New Hampshire. A while back, Erich explained his approach in a guest post for CoolData. (Read it here: Finding connections to your major gift prospects in your data.) I was recently in touch with Erich and he reports that the tool is still in use by prospect researchers at UNH and is still turning up valuable connections.

I tried to build something similar using our reporting tools, but it was a slow-lumbering beast, and not very user-friendly for the people most likely to be interesting in using it — prospect researchers, mainly. I never gave up on the idea of building something better, and finally on a quiet weekend a few months ago, I found two or three hours to mess around with a Python script. Given the ID of a person, this script makes multiple queries of the database and joins the results together to produce a ranked list of other people who might be associated with the input ID.

When the script is run, it prompts the user for an ID, then an Oracle database password. I’ll get into more detail later, but here is what the output looks like in the Python shell of IDLE, the development tool that I used to write the script, when I use my own ID as the one to find connections for. (I have blanked out the ID and password.)

connections_output copy

The script goes through these steps:

  1. It takes the ID provided by the user and queries the database to display the name associated with that ID (that would be me, in this example).
  2. It checks to see what events I attended, and produces a list (in memory) of all the IDs of constituents who attended those same events.
  3. It checks to see what activity codes I have attached to my record — varsity sports teams, student clubs and so on. I have none, but if I did, the script would return a list of all the people who shared the same activity codes. Again, this is saved in memory for use later on.
  4. The same routine is followed for employers (past and present), all class years, degrees, referring high school, cities extracted from mailing addresses (again, all past and present addresses), age, and database cross-references.
  5. All of these lists are combined into a single “object” in memory. The number of instances of each distinct ID is counted up. Every ID retrieved is a “hit”. The idea is, the more hits for any one retrieved ID, the more likely that ID belongs to someone who knows me (or knows the person who we are searching on).
  6. Because the resulting list is huge (more than 31,000 records just for me — and I have very little data attached to my record), the script asks the user to limit by a minimum number of hits. I’ve entered 3 here.
  7. The result is a file with 47 IDs. The script then does a final query on the database to retrieve the names and other details for these people, and saves the results as a .csv file that can be opened in Excel.

Here are the top few rows of the output. I have removed some columns for simplicity, and blanked out some personal information. The count_sum column is the “hit count”, and the list is sorted in descending order by count_sum. The person being searched on is always at the top of the list. This is not an especially impressive example, because I am not an alumnus. My list of likely connections is made up of co-workers and people from my hometown who happen to be the same age as me. The woman in the third row of the data was someone I had a crush on in junior high school but haven’t seen in 27 years … If I were a prospective donor, this search would be a bust. Clearly this method requires quality alumni data in order to be useful.

 

result2

That’s how it functions. We will dig into the actual code shortly. But first, let me say that my primary aim is to share this with you as an idea. I can’t help you with installing Python or debugging or any of that. All I can do is show you my code and hope you can modify it to make it work for you. If you think this is cool and useful but don’t want to learn any code, I suggest you team up with someone from IT or your Advancement Services team, or a student or prof in the Computing Science department.

As I said, I’m using Python and a code library for handling data called pandas. (For some background, see my earlier posts: Getting bitten by Python and Automate RFM scoring of your donors with this Python script.) The great thing about pandas is that you can load data into in-memory structures called DataFrames, which are like data tables in that you can perform joins and sort, rank, de-dupe, add calculated columns, and so on.

A few closing comments before I show you the code:

  • Erich at UNH goes the extra step of matching on date range along with employer. I think a shared corporate connection is worth a point, but being at the same place of employment at the same time is obviously worth a lot more. If you have other ideas for things to match on, I’d love to hear them.
  • Those of you who aren’t into Python but know SQL can probably make an all-SQL version of this. I just like Python.
  • In the code, I have inserted comments to help guide you through. Everything to the right of a “#” is a comment.
  • The script includes multiple individual queries. You may wish to lump them together using UNION ALL, appending all the results sets together right in the body of the SQL. Up to you.
  • The queries that pull records based on shared events, degrees and so on all have the same basic structure: First, a WITH clause gathers the codes for the events (or degrees or cities) that the target ID has attached to his or her record, then a SELECT statement finds all the other IDs who have those same codes attached to their records.
  • In order to break the SQL up into multiple lines, I have enclosed the string between groups of three single quotes. You’ll see that whenever I have to insert a variable into the SQL (such as ‘target_pidm’), I have to use what looks like a confusing jumble of single and double quotes — but the ordering is very precise.
  • I was going to simplify the SQL for you, but didn’t see the point. One way or the other, you will still have to modify it. So it’s much the same as what I actually use. You’ll also notice that I tend not to use aliases for tables and views, which makes the queries text-heavy. Sorry.
  • One limitation of this tool is that it will run only on a machine that has a Python installation. For that reason I haven’t been able to simply give it to someone (in Prospect Research, say) and have them run it themselves. The next step might be bundling it as a stand-alone executable file (an .exe) that could be shared with others, but I haven’t gotten that to work yet.

AND, one final, final comment: This is probably the last piece on Python and database querying I’m going to write for a while. In future posts I will tone down the technical stuff and get back to talking about data and predictive modelling.

 

# Import required code libraries

 

import pandas as pd

import numpy as np

import pandas.io.sql as psql

import cx_Oracle, os

 

 

# User prompts, and connect to Oracle database

 

ID_input = raw_input('Enter ID to find connectors for: ')

 

user = raw_input('Enter username for database: ')

 

password = raw_input('Enter password for database: ')

 

connectstr = user + '/' + password + '@database:0000'

 

connection = cx_Oracle.connect(connectstr)

 

 

# Query to get the database PIDM and name for the entered ID.

 

SQL = '''

SELECT

BANINST1.AA_ENTITY.PIDM,

BANINST1.AA_ENTITY.ID,

BANINST1.AA_ENTITY.FIRST_NAME,

BANINST1.AA_ENTITY.MI,

BANINST1.AA_ENTITY.LAST_NAME

 

FROM BANINST1.AA_ENTITY

 

WHERE

(BANINST1.AA_ENTITY.CHANGE_IND IS NULL

AND BANINST1.AA_ENTITY.ID=''' + "'" + target + "'" + ''')'''

 

df = psql.frame_query(SQL, con=connection)

 

name = str(df['FIRST_NAME']+' '+df['LAST_NAME'])

 

print('Searching for connections to: '+ name)

 

target_pidm = max(df['PIDM'])    # PIDM, not ID, is the primary key in my database.

 

 

# Find events

 

SQL = '''

WITH events AS (

SELECT

BANINST1.AA_EVENTS.EVNT_CRN

 

FROM BANINST1.AA_EVENTS

 

WHERE BANINST1.AA_EVENTS.PIDM = ''' +str(target_pidm)+ ''')

 

SELECT

events.EVNT_CRN,

BANINST1.AA_EVENTS.PIDM

 

FROM

events,

BANINST1.AA_EVENTS

 

WHERE

events.EVNT_CRN = AA_EVENTS.EVNT_CRN (+)'''

 

events = psql.frame_query(SQL, con=connection)

 

print("Found " + str(len(events)) + " people who attended the same events.")

 

 

# Find activities

 

SQL = '''WITH activities AS (

SELECT

BANINST1.AA_ACTIVITY_AND_LEADERS.ACTP_CODE

 

FROM

BANINST1.AA_ACTIVITY_AND_LEADERS

 

WHERE

BANINST1.AA_ACTIVITY_AND_LEADERS.PIDM = '''+str(target_pidm)+''')

 

SELECT

activities.ACTP_CODE,

BANINST1.AA_ACTIVITY_AND_LEADERS.PIDM

 

FROM activities,

BANINST1.AA_ACTIVITY_AND_LEADERS

 

WHERE

activities.ACTP_CODE = BANINST1.AA_ACTIVITY_AND_LEADERS.ACTP_CODE (+)

'''

 

activities = psql.frame_query(SQL, con=connection)

 

print("Found " + str(len(activities)) + " people with the same Activity codes.")

 

 

# Find employers

 

SQL = '''WITH employers AS (

SELECT

BANINST1.AA_ALL_EMPLOYMENT.EMPR_NAME

 

FROM

BANINST1.AA_ALL_EMPLOYMENT

 

WHERE

BANINST1.AA_ALL_EMPLOYMENT.PIDM = '''+str(target_pidm)+''')

 

SELECT

employers.EMPR_NAME,

BANINST1.AA_ALL_EMPLOYMENT.PIDM

 

FROM

employers,

BANINST1.AA_ALL_EMPLOYMENT

 

WHERE

employers.EMPR_NAME = BANINST1.AA_ALL_EMPLOYMENT.EMPR_NAME (+)'''

 

employers = psql.frame_query(SQL, con=connection)

 

print("Found " + str(len(employers)) + " people with the same Employers.")

 

 

# Find class years

 

SQL = '''WITH classes AS (

SELECT

BANINST1.AA_DEGREE.ACYR_CODE

 

FROM

BANINST1.AA_DEGREE

 

WHERE

BANINST1.AA_DEGREE.PIDM = '''+str(target_pidm)+''' )

 

SELECT

classes.ACYR_CODE,

BANINST1.AA_DEGREE.PIDM

 

FROM

classes,

BANINST1.AA_DEGREE

 

WHERE

classes.ACYR_CODE = BANINST1.AA_DEGREE.ACYR_CODE (+)'''

 

classes = psql.frame_query(SQL, con=connection)

 

print("Found " + str(len(classes)) + " people with the same class year(s).")

 

 

# Find degrees

 

SQL = '''

WITH degrees AS (

SELECT

BANINST1.AA_DEGREE.DEGC_CODE

 

FROM

BANINST1.AA_DEGREE

 

WHERE

BANINST1.AA_DEGREE.PIDM = '''+str(target_pidm)+'''

)

 

SELECT

degrees.DEGC_CODE,

BANINST1.AA_DEGREE.PIDM

 

FROM

degrees,

BANINST1.AA_DEGREE

 

WHERE

degrees.DEGC_CODE = BANINST1.AA_DEGREE.DEGC_CODE (+)'''

 

degrees = psql.frame_query(SQL, con=connection)

 

print("Found " + str(len(degrees)) + " people with the same degree(s).")

 

 

# Find high school

 

SQL = '''

WITH high_school AS (

SELECT

BANINST1.AA_HIGH_SCHOOL.HIGH_SCHOOL_CODE

 

FROM

BANINST1.AA_HIGH_SCHOOL

 

WHERE

BANINST1.AA_HIGH_SCHOOL.PIDM = '''+str(target_pidm)+'''

)

 

SELECT

high_school.HIGH_SCHOOL_CODE,

BANINST1.AA_HIGH_SCHOOL.PIDM

 

FROM

high_school,

BANINST1.AA_HIGH_SCHOOL

 

WHERE

high_school.HIGH_SCHOOL_CODE = BANINST1.AA_HIGH_SCHOOL.HIGH_SCHOOL_CODE (+)'''

 

 

hs = psql.frame_query(SQL, con=connection)

 

print("Found " + str(len(hs)) + " people from the same high school.")

 

 

 

# Find cities

 

SQL = '''

WITH cities AS (

SELECT

SATURN.SPRADDR.SPRADDR_CITY

 

FROM

SATURN.SPRADDR

 

WHERE

SATURN.SPRADDR.SPRADDR_PIDM = '''+str(target_pidm)+'''

)

 

SELECT DISTINCT

cities.SPRADDR_CITY,

SATURN.SPRADDR.SPRADDR_PIDM AS PIDM

 

FROM

cities,

SATURN.SPRADDR

 

WHERE

cities.SPRADDR_CITY = SATURN.SPRADDR.SPRADDR_CITY (+)'''

 

 

cities = psql.frame_query(SQL, con=connection)

 

print("Found " + str(len(cities)) + " people with addresses in the same city/cities.")

 

 

# Find age

 

SQL = '''

WITH age AS (

SELECT

BANINST1.AA_ENTITY.AGE

 

FROM

BANINST1.AA_ENTITY

 

WHERE

BANINST1.AA_ENTITY.PIDM = '''+str(target_pidm)+'''

)

 

SELECT

age.AGE,

BANINST1.AA_ENTITY.PIDM

 

FROM

age,

BANINST1.AA_ENTITY

 

WHERE

age.AGE = BANINST1.AA_ENTITY.AGE (+)'''

 

 

age = psql.frame_query(SQL, con=connection)

 

print("Found " + str(len(age)) + " people of the same age.")

 

 

# Find cross-references

 

SQL = '''

WITH xref AS (

SELECT

ALUMNI.APRXREF.APRXREF_XREF_PIDM

 

FROM

ALUMNI.APRXREF

 

WHERE

ALUMNI.APRXREF.APRXREF_PIDM = '''+str(target_pidm)+'''

)

 

SELECT

ALUMNI.APRXREF.APRXREF_XREF_PIDM,

ALUMNI.APRXREF.APRXREF_PIDM AS PIDM

 

FROM

xref,

ALUMNI.APRXREF

 

WHERE

xref.APRXREF_XREF_PIDM = ALUMNI.APRXREF.APRXREF_XREF_PIDM (+)'''

 

 

xref = psql.frame_query(SQL, con=connection)

 

print("Found " + str(len(xref)) + " people who share the same cross-references.")

 

 

 

# Concatenate all results

 

print("Joining results.")

 

results = pd.concat([events, activities, employers, classes, degrees, hs, cities, age, xref])

 

# Remove unneeded columns

 

results = results.drop(['ACTP_CODE', 'ACYR_CODE', 'AGE', 'DEGC_CODE', 'EMPR_NAME', 'EVNT_CRN', 'HIGH_SCHOOL_CODE', 'SPRADDR_CITY', 'APRXREF_XREF_PIDM'], axis=1)

 

print("Found " + str(len(results)) + " total results.")

 

## Count up the number of times each PIDM appears in the results. (A few steps here.)

 

# Add a column called ‘count’ and populate every row with a ‘1’

 

results['count'] = 1

 

# Group by PIDM and sum on ‘count’ to get ‘hits per PIDM’.

 

results = results.join(results.groupby(results['PIDM'])['count'].sum(), on=['PIDM'], rsuffix='_sum')

 

# De-dupe rows in order to get one result per PIDM.

 

results = results.drop_duplicates()

 

# Sort results in descending order by the number of hits.

 

results.sort(columns='count_sum', axis=0, ascending=False, inplace=True)

 

# The file is potentially huge, so prompt the user to limit it by hit count.

 

cutoff = raw_input('How many hits would you like to limit by? ')

 

# Keep rows with minimum number of hits.

 

results = results[results['count_sum'] >= int(cutoff)]

 

# Get rid of the ‘count’ column - not needed.

 

results = results.drop(['count'], axis=1)

 

print("Found " + str(len(results)) + " unique results, with at least "+cutoff+" hits.")

 

 

## Join names to results

 

print('Preparing results and saving.')

 

SQL = '''

SELECT

BANINST1.AA_ENTITY.PIDM,

BANINST1.AA_ENTITY.ID,

BANINST1.AA_ENTITY.NAME_PREFIX,

BANINST1.AA_ENTITY.FIRST_NAME,

BANINST1.AA_ENTITY.MI,

BANINST1.AA_ENTITY.LAST_NAME,

BANINST1.AA_ENTITY.NAME_SUFFIX,

BANINST1.AA_ENTITY.AGE,

BANINST1.AA_ENTITY.DEAD_IND,

BANINST1.AA_DONOR_TYPE_HIGHEST.DONR_CODE

 

FROM

BANINST1.AA_ENTITY,

BANINST1.AA_DONOR_TYPE_HIGHEST

 

WHERE

BANINST1.AA_ENTITY.PIDM = BANINST1.AA_DONOR_TYPE_HIGHEST.PIDM (+)

AND BANINST1.AA_ENTITY.CHANGE_IND IS NULL

'''

 

results = pd.merge(results, psql.frame_query(SQL, con=connection), how='left', on=['PIDM'], sort=False)

 

results.to_csv(target + ' - ' + name + '.csv')

 

print('All done!')

 

 

connection.close()

 

25 March 2014

An all-SQL way to automate RFM scoring

Filed under: RFM, SQL — Tags: , , , — kevinmacdonell @ 8:44 pm

In my last post I described how to automate RFM scoring using the scripting language Python and pandas. Automating RFM scoring is a great way to learn a bit of Python — but on its own it’s a lousy reason for learning Python. It was pointed out to me that you don’t really need to call on the power of Python if you’re already comfortable with (or learning to use) SQL, and I agree.

Shortly after publishing the Python post, I got an email from Bruce Higgins, who works in the Alumni Affairs & Development office at Cornell University. His all-SQL solution is shown below. I leave it to readers to figure out how to make it work for them. In the wake of my post there was some great discussion and code-sharing on the Prospect-DMM discussion list, and I encourage you to look up that thread.

You’ll notice that Bruce has hard-coded the dollar values used for scoring “Monetary value” instead of dynamically dividing the file into quintiles. As he points out, if you’re scoring the same database over and over, it’s not likely these dollar thresholds are going to change much over time. You will of course need to make adjustments for your own donor data.

WITH Five_years AS
(SELECT id,
 SUM((gift_amount)) AS cash,
 MAX(decode(.gift_fiscalyear, 2014, 5, 2013, 4, 2012, 3, 2011, 2, 2010, 1)) AS recency,
 COUNT(DISTINCT(gift_fiscalyear)) AS frequency
 FROM
(SQL specifying the last five years cash transactions from our warehouse)
GROUP BY id
SELECT Five_years.id,
CASE
WHEN Five_years.cash >= 10000 THEN 5
WHEN Five_years.cash >= 2500 THEN 4
WHEN Five_years.cash >= 1000 THEN 3
WHEN Five_years.cash >= 500 THEN 2
ELSE 1
END + Five_years.recency + Five_years.frequency AS rfm_score
FROM Five_years
ORDER BY five_years.name

23 February 2014

Really swell SQL: Why you must know PIVOT and WITH

Filed under: Data, SQL — Tags: , , , — kevinmacdonell @ 2:37 pm

Some data manipulation needs are so common, and their solutions so elusive, that when I find an answer, I just have to share. Today I  will show you two problems and their solutions, wrapped up into one.

The first problem: Given a database query pulling any kind of transactional data (involving dollars, say), how do you go about aggregating the data in more than one way, within the confines of a single query, so that the various aggregations appear on one row of data?

The second problem: Given a database query that returns rows that are duplicates (by constituent ID, for example) due to some category that appears multiple times per ID (category code, gift year or whatever), how do you get rid of those duplicate rows without losing any of the data? In other words, how do you pivot the data so that the categories are added as columns rather than rows, while preserving the structure of all the other data that you don’t want to pivot?

That’s as clear as I can describe it … believe me, if you work with data, you encounter these situations all the time. An illustration might be helpful. Here is some donor gift data, limited to just Donor ID, Fiscal Year Code, and Giving Total, which is the sum of Gift Amount by ID and fiscal year. (For the sake of simplicity, I am using a code for fiscal year instead of extracting the fiscal year from actual gift dates, and I am limiting the data to three fiscal years of donation history.)

fig1

This is fine, but notice that some IDs appear on multiple rows. What if we want only one row of data per donor? And what if we want an overall giving total for the donor in one column, and then yearly totals for the three fiscal years in our data — one column and total per year? Like this:

fig2

Today I’ll show you a couple of techniques to combine different types of aggregations into a single, short SQL statement that isn’t difficult to read. (All of this works in Oracle 11g — I can’t speak to other systems’ implementations of SQL.) My primary need for writing such a query is for embedding custom SQL in a Tableau data source for the purpose of reporting. I don’t really require fancy SQL to render this precise example in Tableau. This is just a very simple example which you can adapt to more complex situations.

The SQL statement that produced our first set of results looks like this. Again for simplicity, let’s assume we require only one table, called GIFTS, so there are no joins to consider:

SELECT GIFTS.ID,
 SUM ( GIFTS.GIFT_AMT ) AS GIVING_TOTAL,
 GIFTS.FISC_YR_CODE

 FROM
 GIFTS

 WHERE 
 GIFTS.FISC_YR_CODE IN ('FY11', 'FY12', 'FY13')

 GROUP BY
 GIFTS.ID,
 GIFTS.FISC_YR_CODE

We end up with one type of aggregation: The SUM of all gifts, grouped by ID and Fiscal Year. But there is no way to include a SUM by ID only, because if we ask for FISC_YR_CODE in the SELECT, we have to include it in the GROUP BY as well. We will deal with that issue a little later on.

First let’s deal with the duplicate IDs. Each donor ID appears once for every fiscal year out of the three that the donor made a gift. We want to pivot the data, so that the rows become columns instead. Introducing … the PIVOT operator! Have a look at the following SQL:

SELECT *

 FROM (

  SELECT GIFTS.ID,
  GIFTS.GIFT_AMT,
  GIFTS.FISC_YR_CODE

  FROM
  GIFTS

  WHERE 
  GIFTS.FISC_YR_CODE IN ('FY11', 'FY12', 'FY13')

 )

PIVOT (
  SUM ( GIFT_AMT ) AS year FOR ( FISC_YR_CODE )
  IN ( 'FY11' AS FY11, 'FY12' AS FY12, 'FY13' AS FY13 )
  )

The inner SELECT statement gets all the data and hands it to the PIVOT operator, and the outer SELECT asks for all the resulting columns from that pivot. The three categories we want to appear as columns (that is, each fiscal year) are specified with IN, and the word “year” is appended to each column name. (This text can be anything you want.) I have added aliases (using AS) in order to prevent the single quote marks, which are required, from appearing in the results.

The content of the “cells” will be the SUM of gift amounts for each ID and fiscal year. The result will look like this:

fig3

Useful, eh? PIVOT does have some limitations, which I will discuss later. Let’s press on. We’re definitely getting close to realizing our goal. Now we just need to have one other aggregation (total giving by donor) appear on the same line. The problem, noted earlier, is that to SUM on giving by ID only, we are forced to leave out all other columns, in this case the fiscal year code. The SQL for giving by donor looks like this:

SELECT GIFTS.ID,
 SUM ( GIFTS.GIFT_AMT ) AS GIVING_TOTAL

 FROM
 GIFTS

 WHERE 
 GIFTS.FISC_YR_CODE IN ('FY11', 'FY12', 'FY13')

 GROUP BY
 GIFTS.ID

Clearly we need not one but two queries, the first to get total giving by donor, and the second to get giving by ID and year (pivoted), and then join the two result sets as if they were tables. Normally this would call for creating temporary tables in the database to store the results for further querying, but data analysts are not DBAs: We may not have sufficient permissions to perform this operation on the database. (As far as I am aware, by default Tableau disallows table creation, perhaps to reassure database admins?)

Fortunately there is a solution. It’s not well-known — my fat Oracle 11g SQL reference book doesn’t even mention it — so you’ll need to do some online searches to find out more. Introducing … the WITH clause!

It works like this. Each WITH clause contains a SELECT statement that returns a result set that behaves like a temporary table, and which can be referenced in other SELECT statements. This example uses two clauses, aliased as ‘total_giving’ and ‘yearly_giving’. A final SELECT statement joins the two as if they were tables.

WITH
 total_giving AS
 (

SELECT GIFTS.ID,
 SUM ( GIFTS.GIFT_AMT ) AS GIVING_TOTAL

 FROM
 GIFTS

 WHERE 
 GIFTS.FISC_YR_CODE IN ('FY11', 'FY12', 'FY13')

 GROUP BY
 GIFTS.ID

 ),

 yearly_giving AS
 (

SELECT *

 FROM (

  SELECT GIFTS.ID,
  GIFTS.GIFT_AMT,
  GIFTS.FISC_YR_CODE

  FROM
  GIFTS

  WHERE 
  GIFTS.FISC_YR_CODE IN ('FY11', 'FY12', 'FY13')

  GROUP BY
  GIFTS.ID,
  GIFTS.FISC_YR_CODE

 )

PIVOT (
  SUM ( GIFT_AMT ) AS year FOR (FISC_YR_CODE)
  IN ('FY11' AS FY11, 'FY12' AS FY12, 'FY13' AS FY13)
  )
 )

SELECT
 total_giving.ID,
 total_giving.GIVING_TOTAL,
 yearly_giving.FY11_YEAR,
 yearly_giving.FY12_YEAR,
 yearly_giving.FY13_YEAR

 FROM
 total_giving,
 yearly_giving

 WHERE
 total_giving.ID = yearly_giving.ID

You can accomplish what I’ve done in this example without using the WITH clause, but as an SQL statement gets more complex, it gets harder to read. As far as I know, you can use as many WITH clauses as you like, allowing you to build complex queries while preserving the neat organization and therefore readability of the SQL statement. Readability is important when you must modify or debug the statement weeks later. Oracle error messages are maddeningly uninformative!

I encourage you to research WITH and PIVOT on your own, but here are a few pointers:

  • Type ‘WITH’ only once. Multiple WITH clauses are separated by a comma. (No comma following the last clause.)
  • A WITH clause can reference any previously-defined WITH clause as if it were an already-existing table. Very useful.
  • When using PIVOT, notice that we must specify all the categories in the pivot (fiscal year code, in this case). Unfortunately, PIVOT is not capable of dynamically pivoting whatever data it happens to find. This is just the way SQL works. If you want to pivot data without knowing in advance what the categories are, you may have to use a true programming language. I use Python, but that is not always an option.
  • You can use any aggregation you like in the PIVOT — COUNT, SUM, MAX, etc.
  • Don’t include the schema and table names in the identifiers named in PIVOT, or you will get an error.

I hope this has been of some help. The nuts and bolts of pulling data for analysis may not be the sexiest part of a data analyst’s job, but brushing up on these skills (whatever your database system is) will pay off in the long run.

Blog at WordPress.com.