SELECT t1.ID, t1.first_name, t1.last_name, UTL_MATCH.jaro_winkler_similarity(t1.first_name, t1.last_name) AS jw FROM schema.persons t1 ORDER BY jw DESC
SELECT t1.ID, t1.first_name, t1.last_name, t2.ID, t2.first_name, t2.last_name, UTL_MATCH.jaro_winkler_similarity ( t1.first_name || ' ' || t1.last_name, t2.first_name || ' ' || t2.last_name ) AS jw FROM schema.persons t1 INNER JOIN schema.persons t2 ON t1.ID != t2.ID AND UTL_MATCH.jaro_winkler_similarity ( t1.first_name || ' ' || t1.last_name, t2.first_name || ' ' || t2.last_name ) >= 98 ORDER BY jw DESC
with sums AS ( select t1.id, t1.fiscal_year, log(10, sum(t1.amount)) AS yr_sum from gifts t1 group by t1.id, t1.fiscal_year), slopes AS ( select distinct sums.id, regr_slope(sums.yr_sum,sums.fiscal_year) OVER (partition by sums.id) AS slope from sums ) select slopes.id, slopes.slope, CASE when slopes.slope is null then 'Null' when slopes.slope >=0.1 then 'Steeply Rising' when slopes.slope >=0.05 then 'Moderately Rising' when slopes.slope >=0.01 then 'Slightly Rising' when slopes.slope >-0.01 then 'Flat' when slopes.slope >-0.05 then 'Slightly Falling' when slopes.slope >-0.1 then 'Moderately Falling' else 'Steeply Falling' end AS description from slopes
We’ve all faced this problem: Duplicate rows in our data. It messes up our reports and causes general confusion and consternation for people we supply data to. The good news is you can take care of this problem fairly easily in an SQL query using a handy function called LISTAGG. I discovered this almost by accident just this week, and I’m delighted to share.
Duplicate rows are a result of what we choose to include in our database query. If we are expecting to get only one row for each constituent ID, then we should be safe in asking for Gender, or Name Prefix, or Preferred Address. There should be only one of each of these things for each ID, and therefore there won’t be duplicate rows. These are examples of one-to-one relationships.
We get into trouble when we ask for something that can have multiple values for a single ID. That’s a one-to-many relationship. Examples: Any address (not just Preferred), Degree (some alumni will have more than one), Category Code (a person can be an alum and a parent and a staff member, all at the same time) … and so on. Below, the first constituent is duplicated on Category, and the second constituent is duplicated on Degree.
That’s not the worst of it; when one ID is duplicated on two or more elements, things get really messy. If A0001 above had three Category codes and two degrees, it would be appear six times (3 x 2) in the result set. Notice that we are really talking about duplicate IDs — not duplicate rows. Each row is, in fact, unique. This is perfectly correct behaviour on the part of the database and query. Try explaining that to your boss, however! You’ve got to do something to clean this up.
What isn’t going to work is pivoting. In SQL, the PIVOT function creates a new column for each unique category of some data element, plus an aggregation of the values for that category. I’ve written about this here: Really swell SQL: Why you must know PIVOT and WITH. Using PIVOT is going to give you something like the result below. (I’ve put in 1’s and nulls to indicate the presence or absence of data.)
What we really want is THIS:
The answer is this: String aggregation. You may already be familiar with aggregating quantitative values, usually by counting or summing. When we query on a donor’s giving, we usually don’t want a row for each gift — we want the sum of all giving. Easy. For strings — categorical data such as Degree or Category Code — it’s a different situation. Sure, we can aggregate by using counting. For example, I can ask for a count of Degree, and that will limit the data to one row per ID. But for most reporting scenarios, we want to display the data itself. We want to know what the degrees are, not just how many of them there are.
The following simplified SQL will get you the result you see above for ID and Category — one row per ID, and a series of category codes separated by commas. This code works in Oracle 11g — I can’t speak to other systems’ implementations of SQL. (For other Oracle-based methods, check out this page. To accomplish the same thing in a non-Oracle variant of SQL called T-SQL, scroll down to the Postscript below.)
Obviously you will need to replace the sample text with your own schema name, table names, and field identifiers.
SELECT SCHEMA.ENTITY.ID, LISTAGG ( SCHEMA.ENTITY.CATG_CODE, ', ' ) WITHIN GROUP ( ORDER BY SCHEMA.ENTITY.CATG_CODE ) AS CATEGORY FROM SCHEMA.ENTITY GROUP BY SCHEMA.ENTITY.ID
The two arguments given to LISTAGG are the field CATG_CODE and a string consisting of a comma and a space. The string, which can be anything you want, is inserted between each Category Code. On the next line, ORDER BY sorts the category codes in alphabetical order.
LISTAGG accepts only two arguments, but if you want to include two or more fields and/or strings in the aggregation, you can just concatenate them inside LISTAGG using “pipe” characters (||). The following is an example using Degree Code and Degree Code Description, with a space between them. To get even fancier, I have replaced the comma separator with the character code for “new line,” which puts each degree on its own line — handy for readability in a list or report. (This works fine when used in Custom SQL in Tableau. For display in Excel, you may have to use a carriage return — char(13) — instead of a line feed.) You will also notice that I have specified a join to a table containing the Degree data.
SELECT SCHEMA.ENTITY.ID, LISTAGG ( SCHEMA.DEGREE.DEGC_CODE || ' ' || SCHEMA.DEGREE.DEGC_DESC, chr(10) ) WITHIN GROUP ( ORDER BY SCHEMA.DEGREE.DEGC_CODE, SCHEMA.DEGREE.DEGC_DESC ) AS DEGREES FROM SCHEMA.ENTITY INNER JOIN SCHEMA.DEGREE ON ( SCHEMA.ENTITY.ID = SCHEMA.DEGREE.ID ) GROUP BY SCHEMA.ENTITY.ID
Before I discovered this capability, the only other automated way I knew how to handle it was in Python. All the previous methods I’ve used were at least partially manual or just very cumbersome. I don’t know how long string aggregation in Oracle has been around, but I’m grateful that a data problem I have struggled with for years has finally been dispensed with. Hurrah!
~~~~~~
POSTSCRIPT
After publishing this I had an email response from David Logan, Director of Philanthropic Analytics at Children’s Mercy in Kansas City, Missouri. He uses TSQL, a proprietary procedural language used by Microsoft in SQL Server. Some readers might find this helpful, so I will quote from his email:
I was reading your post this morning about using the LISTAGG function in Oracle SQL to concatenate values from duplicate rows into a list. A couple of months ago I was grappling with this very problem so that I could create a query of Constituents that included a list of all their phone numbers in a single field. LISTAGG is not available for those of us using T-SQL … the solution for T-SQL is to use XML PATH().
Here’s a link to a source that demonstrates how to use it: http://sqlandme.com/2011/04/27/tsql-concatenate-rows-using-for-xml-path/.
One key difference in this method from LISTAGG is that the delimiter character (such as the comma—or characters as in your example where you have comma space) is added before each value returned so the raw string returned will have this character before the first value. You can use the STUFF() function to eliminate this first delimiter.
Here’s my solution using XML PATH() to return a list of phone numbers from our donor database:
SELECT CA.ID, STUFF ( ( SELECT '; ' + PH.NUM + ' [' + re7_ReportingTools.dbo.GetLongDescription(PH.PHONETYPEID) + ']' FROM re7_db.dbo.CONSTIT_ADDRESS_PHONES CAP INNER JOIN re7_db.dbo.PHONES PH ON CAP.PHONESID=PH.PHONESID WHERE CAP.CONSTITADDRESSID=CA.ID AND PH.DO_NOT_CALL=0 ORDER BY CAP.SEQUENCE FOR XML PATH('') ),1,2,'' ) PHONENUMS FROM re7_db.dbo.CONSTIT_ADDRESS CA GROUP BY CA.ID
I concatenated my list using a semi-colon followed by a space ‘; ‘. I used the STUFF() function to remove the leading semi-colon and space.
Here are a few sample rows that are returned (note I’ve “greeked” the prefixes for donor privacy):
Figuring out connections is a big part of major gift fundraising. A development office might have a great list of prospective donors in hand, but those prospects will languish untouched if the right people who can make introductions are never found. While cultivating relationships with well-connected people is clearly important, research also plays a role in suggesting possible go-betweens. And to the extent that research involves sifting through data to find connections, it’s possible to automate some of the work.
Wouldn’t it be great to have software that asks you to name a prospective donor, and then spits out a list of people in your database who are most likely to be associates of that person? Sure it would!
Social media applications do a version of this, suggesting “people you may know,” based on shared connections and attributes such as schools attended, professional group memberships, employment information, and stated interests. I first encountered a higher-ed advancement application of this concept via Erich Preisendorfer of the University of New Hampshire. A while back, Erich explained his approach in a guest post for CoolData. (Read it here: Finding connections to your major gift prospects in your data.) I was recently in touch with Erich and he reports that the tool is still in use by prospect researchers at UNH and is still turning up valuable connections.
I tried to build something similar using our reporting tools, but it was a slow-lumbering beast, and not very user-friendly for the people most likely to be interesting in using it — prospect researchers, mainly. I never gave up on the idea of building something better, and finally on a quiet weekend a few months ago, I found two or three hours to mess around with a Python script. Given the ID of a person, this script makes multiple queries of the database and joins the results together to produce a ranked list of other people who might be associated with the input ID.
When the script is run, it prompts the user for an ID, then an Oracle database password. I’ll get into more detail later, but here is what the output looks like in the Python shell of IDLE, the development tool that I used to write the script, when I use my own ID as the one to find connections for. (I have blanked out the ID and password.)
The script goes through these steps:
Here are the top few rows of the output. I have removed some columns for simplicity, and blanked out some personal information. The count_sum column is the “hit count”, and the list is sorted in descending order by count_sum. The person being searched on is always at the top of the list. This is not an especially impressive example, because I am not an alumnus. My list of likely connections is made up of co-workers and people from my hometown who happen to be the same age as me. The woman in the third row of the data was someone I had a crush on in junior high school but haven’t seen in 27 years … If I were a prospective donor, this search would be a bust. Clearly this method requires quality alumni data in order to be useful.
That’s how it functions. We will dig into the actual code shortly. But first, let me say that my primary aim is to share this with you as an idea. I can’t help you with installing Python or debugging or any of that. All I can do is show you my code and hope you can modify it to make it work for you. If you think this is cool and useful but don’t want to learn any code, I suggest you team up with someone from IT or your Advancement Services team, or a student or prof in the Computing Science department.
As I said, I’m using Python and a code library for handling data called pandas. (For some background, see my earlier posts: Getting bitten by Python and Automate RFM scoring of your donors with this Python script.) The great thing about pandas is that you can load data into in-memory structures called DataFrames, which are like data tables in that you can perform joins and sort, rank, de-dupe, add calculated columns, and so on.
A few closing comments before I show you the code:
AND, one final, final comment: This is probably the last piece on Python and database querying I’m going to write for a while. In future posts I will tone down the technical stuff and get back to talking about data and predictive modelling.
# Import required code libraries import pandas as pd import numpy as np import pandas.io.sql as psql import cx_Oracle, os # User prompts, and connect to Oracle database ID_input = raw_input('Enter ID to find connectors for: ') user = raw_input('Enter username for database: ') password = raw_input('Enter password for database: ') connectstr = user + '/' + password + '@database:0000' connection = cx_Oracle.connect(connectstr) # Query to get the database PIDM and name for the entered ID. SQL = ''' SELECT BANINST1.AA_ENTITY.PIDM, BANINST1.AA_ENTITY.ID, BANINST1.AA_ENTITY.FIRST_NAME, BANINST1.AA_ENTITY.MI, BANINST1.AA_ENTITY.LAST_NAME FROM BANINST1.AA_ENTITY WHERE (BANINST1.AA_ENTITY.CHANGE_IND IS NULL AND BANINST1.AA_ENTITY.ID=''' + "'" + target + "'" + ''')''' df = psql.frame_query(SQL, con=connection) name = str(df['FIRST_NAME']+' '+df['LAST_NAME']) print('Searching for connections to: '+ name) target_pidm = max(df['PIDM']) # PIDM, not ID, is the primary key in my database. # Find events SQL = ''' WITH events AS ( SELECT BANINST1.AA_EVENTS.EVNT_CRN FROM BANINST1.AA_EVENTS WHERE BANINST1.AA_EVENTS.PIDM = ''' +str(target_pidm)+ ''') SELECT events.EVNT_CRN, BANINST1.AA_EVENTS.PIDM FROM events, BANINST1.AA_EVENTS WHERE events.EVNT_CRN = AA_EVENTS.EVNT_CRN (+)''' events = psql.frame_query(SQL, con=connection) print("Found " + str(len(events)) + " people who attended the same events.") # Find activities SQL = '''WITH activities AS ( SELECT BANINST1.AA_ACTIVITY_AND_LEADERS.ACTP_CODE FROM BANINST1.AA_ACTIVITY_AND_LEADERS WHERE BANINST1.AA_ACTIVITY_AND_LEADERS.PIDM = '''+str(target_pidm)+''') SELECT activities.ACTP_CODE, BANINST1.AA_ACTIVITY_AND_LEADERS.PIDM FROM activities, BANINST1.AA_ACTIVITY_AND_LEADERS WHERE activities.ACTP_CODE = BANINST1.AA_ACTIVITY_AND_LEADERS.ACTP_CODE (+) ''' activities = psql.frame_query(SQL, con=connection) print("Found " + str(len(activities)) + " people with the same Activity codes.") # Find employers SQL = '''WITH employers AS ( SELECT BANINST1.AA_ALL_EMPLOYMENT.EMPR_NAME FROM BANINST1.AA_ALL_EMPLOYMENT WHERE BANINST1.AA_ALL_EMPLOYMENT.PIDM = '''+str(target_pidm)+''') SELECT employers.EMPR_NAME, BANINST1.AA_ALL_EMPLOYMENT.PIDM FROM employers, BANINST1.AA_ALL_EMPLOYMENT WHERE employers.EMPR_NAME = BANINST1.AA_ALL_EMPLOYMENT.EMPR_NAME (+)''' employers = psql.frame_query(SQL, con=connection) print("Found " + str(len(employers)) + " people with the same Employers.") # Find class years SQL = '''WITH classes AS ( SELECT BANINST1.AA_DEGREE.ACYR_CODE FROM BANINST1.AA_DEGREE WHERE BANINST1.AA_DEGREE.PIDM = '''+str(target_pidm)+''' ) SELECT classes.ACYR_CODE, BANINST1.AA_DEGREE.PIDM FROM classes, BANINST1.AA_DEGREE WHERE classes.ACYR_CODE = BANINST1.AA_DEGREE.ACYR_CODE (+)''' classes = psql.frame_query(SQL, con=connection) print("Found " + str(len(classes)) + " people with the same class year(s).") # Find degrees SQL = ''' WITH degrees AS ( SELECT BANINST1.AA_DEGREE.DEGC_CODE FROM BANINST1.AA_DEGREE WHERE BANINST1.AA_DEGREE.PIDM = '''+str(target_pidm)+''' ) SELECT degrees.DEGC_CODE, BANINST1.AA_DEGREE.PIDM FROM degrees, BANINST1.AA_DEGREE WHERE degrees.DEGC_CODE = BANINST1.AA_DEGREE.DEGC_CODE (+)''' degrees = psql.frame_query(SQL, con=connection) print("Found " + str(len(degrees)) + " people with the same degree(s).") # Find high school SQL = ''' WITH high_school AS ( SELECT BANINST1.AA_HIGH_SCHOOL.HIGH_SCHOOL_CODE FROM BANINST1.AA_HIGH_SCHOOL WHERE BANINST1.AA_HIGH_SCHOOL.PIDM = '''+str(target_pidm)+''' ) SELECT high_school.HIGH_SCHOOL_CODE, BANINST1.AA_HIGH_SCHOOL.PIDM FROM high_school, BANINST1.AA_HIGH_SCHOOL WHERE high_school.HIGH_SCHOOL_CODE = BANINST1.AA_HIGH_SCHOOL.HIGH_SCHOOL_CODE (+)''' hs = psql.frame_query(SQL, con=connection) print("Found " + str(len(hs)) + " people from the same high school.") # Find cities SQL = ''' WITH cities AS ( SELECT SATURN.SPRADDR.SPRADDR_CITY FROM SATURN.SPRADDR WHERE SATURN.SPRADDR.SPRADDR_PIDM = '''+str(target_pidm)+''' ) SELECT DISTINCT cities.SPRADDR_CITY, SATURN.SPRADDR.SPRADDR_PIDM AS PIDM FROM cities, SATURN.SPRADDR WHERE cities.SPRADDR_CITY = SATURN.SPRADDR.SPRADDR_CITY (+)''' cities = psql.frame_query(SQL, con=connection) print("Found " + str(len(cities)) + " people with addresses in the same city/cities.") # Find age SQL = ''' WITH age AS ( SELECT BANINST1.AA_ENTITY.AGE FROM BANINST1.AA_ENTITY WHERE BANINST1.AA_ENTITY.PIDM = '''+str(target_pidm)+''' ) SELECT age.AGE, BANINST1.AA_ENTITY.PIDM FROM age, BANINST1.AA_ENTITY WHERE age.AGE = BANINST1.AA_ENTITY.AGE (+)''' age = psql.frame_query(SQL, con=connection) print("Found " + str(len(age)) + " people of the same age.") # Find cross-references SQL = ''' WITH xref AS ( SELECT ALUMNI.APRXREF.APRXREF_XREF_PIDM FROM ALUMNI.APRXREF WHERE ALUMNI.APRXREF.APRXREF_PIDM = '''+str(target_pidm)+''' ) SELECT ALUMNI.APRXREF.APRXREF_XREF_PIDM, ALUMNI.APRXREF.APRXREF_PIDM AS PIDM FROM xref, ALUMNI.APRXREF WHERE xref.APRXREF_XREF_PIDM = ALUMNI.APRXREF.APRXREF_XREF_PIDM (+)''' xref = psql.frame_query(SQL, con=connection) print("Found " + str(len(xref)) + " people who share the same cross-references.") # Concatenate all results print("Joining results.") results = pd.concat([events, activities, employers, classes, degrees, hs, cities, age, xref]) # Remove unneeded columns results = results.drop(['ACTP_CODE', 'ACYR_CODE', 'AGE', 'DEGC_CODE', 'EMPR_NAME', 'EVNT_CRN', 'HIGH_SCHOOL_CODE', 'SPRADDR_CITY', 'APRXREF_XREF_PIDM'], axis=1) print("Found " + str(len(results)) + " total results.") ## Count up the number of times each PIDM appears in the results. (A few steps here.) # Add a column called ‘count’ and populate every row with a ‘1’ results['count'] = 1 # Group by PIDM and sum on ‘count’ to get ‘hits per PIDM’. results = results.join(results.groupby(results['PIDM'])['count'].sum(), on=['PIDM'], rsuffix='_sum') # De-dupe rows in order to get one result per PIDM. results = results.drop_duplicates() # Sort results in descending order by the number of hits. results.sort(columns='count_sum', axis=0, ascending=False, inplace=True) # The file is potentially huge, so prompt the user to limit it by hit count. cutoff = raw_input('How many hits would you like to limit by? ') # Keep rows with minimum number of hits. results = results[results['count_sum'] >= int(cutoff)] # Get rid of the ‘count’ column - not needed. results = results.drop(['count'], axis=1) print("Found " + str(len(results)) + " unique results, with at least "+cutoff+" hits.") ## Join names to results print('Preparing results and saving.') SQL = ''' SELECT BANINST1.AA_ENTITY.PIDM, BANINST1.AA_ENTITY.ID, BANINST1.AA_ENTITY.NAME_PREFIX, BANINST1.AA_ENTITY.FIRST_NAME, BANINST1.AA_ENTITY.MI, BANINST1.AA_ENTITY.LAST_NAME, BANINST1.AA_ENTITY.NAME_SUFFIX, BANINST1.AA_ENTITY.AGE, BANINST1.AA_ENTITY.DEAD_IND, BANINST1.AA_DONOR_TYPE_HIGHEST.DONR_CODE FROM BANINST1.AA_ENTITY, BANINST1.AA_DONOR_TYPE_HIGHEST WHERE BANINST1.AA_ENTITY.PIDM = BANINST1.AA_DONOR_TYPE_HIGHEST.PIDM (+) AND BANINST1.AA_ENTITY.CHANGE_IND IS NULL ''' results = pd.merge(results, psql.frame_query(SQL, con=connection), how='left', on=['PIDM'], sort=False) results.to_csv(target + ' - ' + name + '.csv') print('All done!') connection.close()
In my last post I described how to automate RFM scoring using the scripting language Python and pandas. Automating RFM scoring is a great way to learn a bit of Python — but on its own it’s a lousy reason for learning Python. It was pointed out to me that you don’t really need to call on the power of Python if you’re already comfortable with (or learning to use) SQL, and I agree.
Shortly after publishing the Python post, I got an email from Bruce Higgins, who works in the Alumni Affairs & Development office at Cornell University. His all-SQL solution is shown below. I leave it to readers to figure out how to make it work for them. In the wake of my post there was some great discussion and code-sharing on the Prospect-DMM discussion list, and I encourage you to look up that thread.
You’ll notice that Bruce has hard-coded the dollar values used for scoring “Monetary value” instead of dynamically dividing the file into quintiles. As he points out, if you’re scoring the same database over and over, it’s not likely these dollar thresholds are going to change much over time. You will of course need to make adjustments for your own donor data.
WITH Five_years AS
(SELECT id, SUM((gift_amount)) AS cash, MAX(decode(.gift_fiscalyear, 2014, 5, 2013, 4, 2012, 3, 2011, 2, 2010, 1)) AS recency, COUNT(DISTINCT(gift_fiscalyear)) AS frequency FROM
(SQL specifying the last five years cash transactions from our warehouse)
GROUP BY id
SELECT Five_years.id, CASE WHEN Five_years.cash >= 10000 THEN 5 WHEN Five_years.cash >= 2500 THEN 4 WHEN Five_years.cash >= 1000 THEN 3 WHEN Five_years.cash >= 500 THEN 2 ELSE 1 END + Five_years.recency + Five_years.frequency AS rfm_score
FROM Five_years ORDER BY five_years.name