CoolData blog

5 December 2016

Amazing things with matching strings

Filed under: Coolness, Data integrity, SQL — Tags: , , , — kevinmacdonell @ 7:44 am

 

I had an occasion recently when it would have been really helpful to know that a new address added to the database was a duplicate of an older, inactivated address. The addition wasn’t identified as a duplicate because it wasn’t a perfect match — a difference similar to that between 13 Anywhere Road and 13 Anywhere Drive. 

 

After the fact, I did a Google search and discovered some easy-to-use functionality in Oracle SQL that might have saved us some trouble. Today I want to talk about how to use UTL_MATCH and  suggest some cool applications for it in Advancement data work.

 

“Fuzzy matching” is the term used for identifying pairs of character strings that may not be exactly the same, but are so close that they could be. For example, “Washignton” is one small typo away from “Washington,” but the equivalence is very difficult to detect by any means other than an alert pair of human eyes scanning a sorted list. When the variation occurs at the beginning of a string — “Unit 3, 13 Elm St.” instead of “Apmt 3, 13 Elm St.” — then even a sorted list is of no use.

 

According to this page, the UTL_MATCH package was introduced in Oracle 10g Release 2, but first documented and supported in Oracle 11g Release 2. The package includes two functions for testing the level of similarity or difference between strings.

 

The first function is called EDIT_DISTANCE, which is a count of the number of “edits” to get from one string to a second string. For example, the edit distance from “Kevin” to “Kelvin” is 1, for “New York” to “new york” is 2, and from “Hello” to “Hello” is 0. (A related function, EDIT_DISTANCE_SIMILARITY, expresses the distance as a normalized value between 0 and 100 — 100 being a perfect match.)

 

The second method, the one I’ve been experimenting with, is called JARO_WINKLER, named for an algorithm that measures the degree of similarity between two strings. The result ranges between 0 (no similarity) to 1 (perfect similarity). It was designed specifically for detecting duplicate records, and its formula seems aimed at the kind of character transpositions you’d expect to encounter in data entry errors. (More info here: Jaro-Winkler distance.)

 

Like EDIT_DISTANCE, it has a related function called JARO_WINKLER_SIMILARITY. Again, this ranges from 0 (no match) to 100 (perfect match). This is the function I will refer to for the rest of this post.

 

Here is a simple example of UTL_MATCH in action. The following SQL scores constituents in your database according to how similar their first name is to their last name, with the results sorted in descending order by degree of similarity. (Obviously, you’ll need to replace “schema”, “persons,” and field names with the proper references from your own database.)

 

SELECT

t1.ID,

t1.first_name,

t1.last_name,

UTL_MATCH.jaro_winkler_similarity(t1.first_name, t1.last_name) AS jw

FROM schema.persons t1

ORDER BY jw DESC

 

Someone named “Donald MacDonald” would get a fairly high value for JW, while “Kevin MacDonell” would score much lower. “Thomas Thomas” would score a perfect 100.

 

Let’s turn to a more useful case: Finding potential duplicate persons in your database. This entails comparing a person’s full name with the full name of everyone else in the database. To do that, you’ll need a self-join.

 

In the example below, I join the “persons” table to itself. I concatenate first_name and last_name to make a single string for the purpose of matching. In the join conditions, I exclude records that have the same ID, and select records that are a close or perfect match (according to Jaro-Winkler). To do this, I set the match level at some arbitrary high level, in this case greater than or equal to 98.

 

SELECT

t1.ID,

t1.first_name,

t1.last_name,

t2.ID,

t2.first_name,

t2.last_name,

UTL_MATCH.jaro_winkler_similarity ( t1.first_name || ' ' || t1.last_name, t2.first_name || ' ' || t2.last_name ) AS jw

FROM schema.persons t1

INNER JOIN schema.persons t2 ON t1.ID != t2.ID AND UTL_MATCH.jaro_winkler_similarity ( t1.first_name || ' ' || t1.last_name, t2.first_name || ' ' || t2.last_name ) >= 98

ORDER BY jw DESC

 

I would suggest reading this entire post before trying to implement the example above! UTL_MATCH presents some practical issues which limit what you can do. But before I share the bad news, here are some exciting possible Advancement-related applications:

 

  • Detecting duplicate records via address matching.
  • Matching external name lists against your database. (Which would require the external data be loaded into a temporary table in your data warehouse, perhaps.)
  • Screening current and incoming students against prospect, donor, and alumni records for likely matches (on address primarily, then perhaps also last name).
  • Data integrity audits. An example: If the postal code or ZIP is the same, but the city name is similar (but not perfectly similar), then there may be an error in the spelling or capitalization of the city name.
  • Searches on a particular name. If the user isn’t sure about spelling, this might be one way to get suggestions back that are similar to the guessed spelling.

 

Now back to reality … When you run the two code examples above, you will probably find that the first executes relatively quickly, while the second takes a very long time or fails to execute at all. That is due to the fact that you’re evaluating each record in the database against every other record. This is what’s known as a cross-join or Cartesian product — a very costly join which is rarely used. If you try to search for matches across 100,000 records, that’s 10 billion evaluations! The length of the strings themselves contributes to the complexity, and therefore the runtime, of each evaluation — but the real issue is the 10,000,000,000 operations.

 

As intriguing as UTL_MATCH is, then, its usage will cause performance issues. I am still in the early days of playing with this, but here are a few things I’ve learned about avoiding problems while using UTL_MATCH.

 

Limit matching records. Trying to compare the entire database with itself is going to get you in trouble. Limit the number of records retrieved for comparison. A query searching for duplicates might focus solely on the records that have been added or modified in the past day or two, for example. Even so, those few records have to be checked against all existing records, so it’s still a big job — consider not checking against records that are marked deceased, that are non-person entities, and so on. Anything to cut down on the number of evaluations the database has to perform.

 

Keep strings short. Matching works best when working with short strings. Give some thought to what you really want to match on. When comparing address records, it might make sense to limit the comparison to Street Line 1 only, not an entire address string which could be quite lengthy.

 

Pre-screen for perfect matches: A Jaro-Winkler similarity of 100 means that two strings are exactly equal. I haven’t tested this, but I’m guessing that checking for A = B is a lot faster than calculating the JW similarity between A and B. It might make sense to have one query to audit for perfect matches (without the use of UTL_MATCH) and exclude those records from a second query that audits for JW similarities that are high but less than a perfect 100.

 

Pre-screen for impossible matches. If a given ID_1 has a street address than is 60 characters long and a given ID_2 has a street address that is only 20 characters long, there is no possibility of a high Jaro-Winkler score and therefore no need to calculate it. Find a way to limit the data set to match before invoking UTL_MATCH, possibly through the use of a WITH clause that limits potential matching pairs by excluding any that differ in length by more than, say, five characters. (Another “pre-match” to use would check if the initial letter in a name is the same; if it isn’t, good chance it isn’t going to be a match.)

 

Keep match queries simple. Don’t ask for fields other than ID and the fields you’re trying to match on. Yes, it does make sense to bring down birthdate and additional address information so that the user can decide if a probable match is a true duplicate or not, but keep that part of the query separate from the match itself. You can do this by putting the match in a WITH clause, and then left-joining additional data to the results of that clause.

 

Truth be told, I have not yet written a query that does something useful while still executing in a reasonable amount of time, simply due to the sheer number of comparisons being made. I haven’t given up on SQL, but it could be that duplicate detection is better accomplished via a purpose-built script running on a standalone computer that is not making demands on an overburdened database or warehouse (aside from the initial pull of raw data for analysis).

 

The best I’ve done so far is a query that selects address records that were recently modified and matches them against other records in the database. Before it applies Jaro-Winkler, the query severely limits the data by pairing up IDs that have name strings and address strings that are nearly the same number of characters long. The query has generated a few records to investigate and, if necessary, de-dupe — but it takes more than an hour and half to run.

 

Have any additional tips for making use of UTL_MATCH? I’d love to hear and share. Email me at kevin.macdonell@gmail.com.

 

Advertisement

13 June 2016

Nifty SQL regression to calculate donors’ giving trends

Filed under: Coolness, Predictor variables, regression, SQL — Tags: , , , — kevinmacdonell @ 8:28 pm

 

Here’s a nifty bit of SQL that calculates a best-fit line through a donor’s years of cash-in giving by fiscal year (ignoring years with no giving), and classifies that donor in terms of how steeply they are “rising” or “falling”.

 

I’ll show you the sample code, which you will obviously have to modify for your own database, and then talk a little bit about how I tested it. (I know this works in Oracle version 11g. Not sure about earlier versions, or other database systems.)

 

with sums AS (
 select t1.id, t1.fiscal_year, log(10, sum(t1.amount)) AS yr_sum
 from gifts t1
 group by t1.id, t1.fiscal_year),

slopes AS (
 select distinct
 sums.id,
 regr_slope(sums.yr_sum,sums.fiscal_year) OVER (partition by sums.id) AS slope

from sums
 )

select
 slopes.id,
 slopes.slope,
 CASE
 when slopes.slope is null then 'Null'
 when slopes.slope >=0.1 then 'Steeply Rising'
 when slopes.slope >=0.05 then 'Moderately Rising'
 when slopes.slope >=0.01 then 'Slightly Rising'
 when slopes.slope >-0.01 then 'Flat'
 when slopes.slope >-0.05 then 'Slightly Falling'
 when slopes.slope >-0.1 then 'Moderately Falling'
 else 'Steeply Falling' end AS description

from slopes
That’s it. Not a lot of SQL, and it runs very quickly (for me). But does it actually tell us anything?

 

I devised a simple test. Adapting this query, I calculated the “slope of giving” for all donors over a five-year period in the past: FY 2007 to FY 2011. I wanted to see if this slope could predict whether, and by how much, a donor’s giving would rise or fall in the next five-year period: FY 2012 to FY 2016. (Note that the sum of a donor’s giving in each year is log-transformed, in order to better handle outlier donors with very large giving totals.)

 

I assembled a data file with each donor’s sum of cash giving for the first five-year period, the slope of their giving in that period, and the sum of their cash giving for the five-year period after that.

 

The first test was to see how the categories of slope, from Steeply Rising to Steeply Falling, translated into subsequent rises and falls. In Data Desk, I compared the two five-year periods. If the second period’s giving was greater than the first, I called that a “rise.” If it was less, I called it a “fall.” And if it was exactly the same, I called it “Same.”

 

The table below summarizes the results. Note that these numbers are all percentages, summed horizontally. (I will explain the colour highlighting later on.)

 

contingency1

 

For Steeply Rising, 60.6% of donors actually FELL from the first period to the next. Only 37.8 percent rose, and just 1.6% stayed exactly the same. Not terribly impressive. Look at Steeply Falling, though: More than three-quarters actually did fall. That’s a better result, but then again, “Falling” dominates for every category; in the whole file, close to 70% of all donors reduced their giving in the next period. If a donor has no giving in the second period of five years, that’s zero dollars given, and this is called a “Fall” — more on that aspect in just a sec.

 

(I’ve left out donors with a FY2007-11 slope of Null — they’re the ones who gave in only one year and therefore don’t have a “slope”.)

 

Let’s not give up just yet, however. The colour highlighting indicates how high each percentage value is in relation to those above and below it. For example, the highest percentages in the Falling column are found in the Slightly, Moderately, and especially Steeply Falling slope categories. The highest percentages in the Rising column are in the Slightly, Moderately, and Steeply Rising slope categories. And in the Same column, the Flat slope wins hands-down — as we would hope.

 

So a rising slope “sort of” predicts increased giving, a falling slope “sort of” predicts decreased giving. Unfortunately, many donors are not retained into the second five-year period, so there’s not a lot to be confident about.

 

But what if a donor IS retained? What if we exclude the lapsed donors entirely? Let’s do that:

 

contingency2

 

Excluding non-donors seems to lead to an improvement … The slope does a better job sorting between the risers and fallers when a donor is actually retained. Again, the colour highlighting is referencing columns, not rows. But notice now that, across the rows, Rising has a slight majority for the Rising slope categories, and Falling has a slight majority for the Falling slope categories. (The bar is set too high for Flat, however, given that a donor’s giving in the first five years has to be exactly equal to her giving in the second five years to be called Same.)

 

Admittedly, these majorities are not generous. If I calculated a donor’s slope of giving as Steeply Rising and that donor was retained, I have only a 56.4% chance of actually being right. And of course there’s no guarantee that donor won’t lapse.

 

(Note that these are donors of all types — alumni, non-alumni individuals, and entities such as corporations and foundations. Non-alumni donors tend not to have patterns in their giving that are repeated, not to the extent that alumni do. However, when I limit the data file to alumni donors only, the improvement in this method is only very slight.)

 

Pressing on … I did a regression analysis using total giving in the second five-year period as the dependent variable, then entered total giving in the prior five-year period as an independent variable. (Naturally, R-squared was very high.) This allowed me to see if Slope provides any explanatory power when it is added as the second independent variable — the effect of giving in the first five-year period already being accounted for.

 

And the answer is, yes, it does. But only under specific conditions: Both five-year giving totals were log-transformed and, most significantly, donors who did not give in the second period were excluded from the regression.

 

There are other way to assess the usefulness of “slope” which might lead to an application, and I encourage you to give this a try with your own data. From past experience I know that donors who make big upgrades in giving don’t have any neat universal pattern such as an upward slope in their giving history. (The concept of volatility is explored here and here.) “Slope” is probably too simple a characteristic to employ on its own.

 

But as I’ve said before, if it were easy, obvious, or intuitive, it wouldn’t be data analysis.

 

30 May 2016

Donor volatility: testing years of non-giving as a predictor for the next big gift

Filed under: Annual Giving, Coolness — Tags: , , , , — kevinmacdonell @ 5:02 am

Guest post by Jessica Kostuck, Data Analyst, Annual Giving, Queen’s University

 

During my first few weeks on the job, my AD set me up on several calls with colleagues in similar, data-driven roles, at universities across the country. One such call was with Kevin MacDonell, keeper of CoolData, with whom I had a delightfully geeked-out conversation about predictive modeling. We ran the gamut of weird and wonderful data points, ending on the concept of donor volatility.

 

When a lapsed high-end donor has no discernable annual giving pattern, is it possible to use his or her years of non-giving to predict and influence their next big gift?

 

Our goal for our Annual Giving program was to identify these “volatile” donors (lapsed high-end donors with an erratic giving history), and reactivate (ideally, upgrade) them, through a targeted solicitation with an aggressive ask string.

 

(For more on volatility, see Odd but true findings? Upgrading annual donors are “erratic” and “volatile”, which describes findings that suggest the best prospects for a big upgrade in giving are those who are “erratic”, i.e. have prior giving but are not loyal, every-year donors, and “volatile”, i.e. are inconsistent about the amounts they give.)

 

I did some stock market research (see footnote), decided on a minimum value for the entry-point into our volatility matrix ($500), and, together with Senior Programmer Analyst, Kim Wilkinson, got cracking on writing a program to identify volatile donors.

 

volatile sql clip

 

 

Our ideal volatile donors had given ≥ $500 at least twice in the last 10 years, without any consecutive (“stable”) periods. Year over year, our ideal volatile donor would act in one of three ways – increase their giving by at least 60%, decrease their giving by at least 60%, or not give at all. Given the capacity level displayed by these volatile donors, we replaced years of very low-end giving <$99) with null values (“throwaway gifts”).

 

We had strict conditions for what would remove a donor from our table. If a donor had two years of consecutive giving within a ±60% differential from their previous highest giving point (v_value), we considered this a natural (or, at least, for this test, not sufficiently irregular) fluctuation in giving, and they were removed from the table. If the donor had two consecutive years of low-end (but not null) giving ($99-$499), this was considered a deliberate decrease, and they, too, were removed. Conversely, if a donor had two consecutive years of greatly increased giving, this was considered a deliberate increase, and they were also removed.

 

At any point, a donor could be admitted, or readmitted into our volatility matrix, by establishing, or re-establishing, a v_value and subsequent valid volatility point.

 

The difference between a lapsed donor and a volatile donor

 

Below is a sample pool of donors we examined.

 

volatile donor history image

 

Donor 1 is volatile all the way through, with greatly varying levels of giving, culminating in two years of non-giving. Donor 1 is currently volatile, and thus enters our test group.

 

Donor 2 is volatile for two years – FY07-08 and FY08-09 (v_value of $5,000 in FY07-08, followed by a valid volatile point in FY08-09 with a decrease of -80%), but then is removed from the table in FY09-10 with only a -50% decrease in giving. They do not establish a new v_value, even though their FY09-10 giving meets the minimum giving threshold for this test, because of their consecutive, only marginally decreased giving in FY10-11. This excludes Donor 2 from our test.

 

Donor 3 enters our volatility matrix in FY04-05, leaves in FY07-08, reenters in FY10-11, and maintains volatility to current day, and, thus, enters into our test solicitation.

 

While all three of these donors are lapsed, and are all SYBUNTs, only Donor 1 and Donor 3 are, by our definition, volatile.

 

Solicitation strategy and results

 

We now had a pool of constituents who were at least two years lapsed in giving, who all had a history of inconsistent, but not unsubstantial, contributions to the university. In an email solicitation, we presented constituents with both upgrade language and an aggressive ask matrix, beginning at a minimum of +60% of their highest ever v_value, regardless of where they were in the ebb and flow of their volatility cycle. Again, the goal of this test was to (1) identify donors with high capacity (2) whose giving to the university was erratic in frequency and loyalty and (3) encourage these donors to reactivate at greater than their previously-established high-end giving.

 

In our results analysis, we broadened our examination to include any gifts received from our testing pool within the subsequent four weeks, not just gifts linked to this particular solicitation code, to verify the legitimacy of tagging these donors as volatile – that is, having a higher-than-average probability to reactivate at a high-end giving level.

 

An important part of our analysis included comparing our testing pool to a control pool, pairing each of our volatile donors with a non-volatile twin who shared as many points of fiscal and biographic information as was possible.

 

Within the four-week time frame, our test group had about a 7% activity rate, whereas our control group had an activity rate of about 5% (average for the institution during this timeframe). Within our volatility test group, 50% of donors gave an amount that would plot a valid point on our volatility matrix.

 

Conclusion and next steps

 

Through our experiment, we sought to identify volatile donors, and test if we could trigger a reactivation in giving, ideally at, or greater than, their highest level on record.

 

Since not all of the donors within our test group made their gifts to the coded solicitation with the volatile ask matrix, it is indiscernible whether being presented with language and ask amounts that reflected their elusive giving behavior prompted a gift – volatile or otherwise. However, we do feel confident that we’re onto something when it comes to identifying and predicting the behavior of a particular, valuable set of donors to our institution.

 

Our above-average response rate (both versus the control group, and institution-wide) supports our “theory of volatility”, insofar as that volatile donors are an existing pool with shared behaviors within our donor population. We plan to re-run this test again at the same time next year, continuing our search to find a pattern within the instability.

 

Were we able to gather definitive results that will define and shape future annual giving strategy? Not exactly. But as far as data goes, this was definitely cool.

 

Jessica Kostuck is the Data Analyst, Annual Giving at Queen’s University in Kingston, Ontario. She can be reached at jessica.kostuck@queensu.ca.

 

————-

1. Varadi, David. “Volatility Differentials: High/Low Volatility versus Close/Close Volatility (HVL-CCV).” CSS Analytics. 29 Mar. 2011. Web. Winter 2015.

30 July 2013

Getting bitten by Python

When I was first learning to build predictive models, preparing the data was part of the adventure. In time, though, many operations on the data became standard instead of exploratory. Eventually they became simply repetitive and tedious. When any task becomes repetitive, I think of ways to automate it. Given that data prep makes up 80 percent of the work of building a model (according to some authors), the benefits of automation are obvious.

I can think of only two ways to replicate the manual operations you need to perform on a large data set to make it ready for modelling: Use software specially designed for the task, or code your own data-handling scripts. I am lazy and drawn to software solutions that make hard things easy, and I’m not a programmer. Yet I have veered away from a ready-made software solution to pursue an interest in the scripting language called Python, and in particular the Python code library called pandas, written specifically for working with data.

Maybe it’s because Python is open-source and free, or because it is powerful, or because it is flexible and widely adaptable to multiple uses on the job. I don’t know. But for the past few months I’ve been obsessed with learning to use it, and that’s what I’d like to talk about today.

I’m guessing very few CoolData readers have experience writing scripts for handling data. I know some people who do most of their stats work in the R language or manipulate data in Excel using VBA. But the majority of readers probably consider themselves severely allergic to coding of any kind. I concede that it isn’t for everyone, but look: Just as we don’t need to be professional statisticians to use statistical tools to create value for the business, we don’t need to be computer scientists to write useful scripts that can free up large chunks of time we now spend on routine tasks that bore us.

(If you work with someone in IT or Advancement Services who pulls and reshapes your data for you, they might be especially interested in the idea of learning how to automate your requests. They might also be familiar with Python already.)

I should say here that my aim is not to automate predictive modelling itself. There are Python modules for modelling, too, from the venerable classics such as regression to the latest advanced techniques. But I’m not so much interested in them, not yet at least. Building predictive models is best done hands-on, guided by a human modeler’s expertise and domain knowledge. My main interest is in eliminating a big chunk of the standard rote work so that I can apply the freshest version of myself to the more interesting and creative elements of data exploration and model creation.

So what is Python (and more specifically, pandas) good for?

  • A script or program can execute a series of database queries and join the results in exactly the way you want, allowing you to build very complex structures and incorporate custom aggregations that might be harder to do using your existing querying/reporting tools. For example, let’s say you want to build a file of donors and include columns for date of first and last gift, amount of highest gift, total cash gifts for the past five fiscal years, and percentage of total giving devoted to student financial assistance. Unless IT has built some advanced views for you from the base tables in your database, many of these variables will require applying some calculations to the raw transactional data. I could certainly build a query to get the results for this modest example, but it would involve a few sub-queries and calculated fields. Multiply that by a hundred and you’ve got an idea of how complex a query you’d have to build to deliver a modelling-ready data set. In fact it may be technically impossible, or at least difficult, to build such a single massive query. In Python, however, you can build your data file up in an orderly series of steps. Adding, removing or editing those steps is not a big deal.
  • Python also makes it simple to read data from .csv and Excel files, and merge it painlessly with the data you’ve extracted from your database. This is important to me because not all of my modelling data comes from our database. I’ve got eight years of call centre data results by alumni ID, wealth-related census data by Canadian postal code, capacity data by American ZIP code, and other standalone data sets. Adding these variables to my file used to be a tedious, manual process. In Python, left-joining 20 columns of census data to a file of 100,000 alumni records using Postal Code as the join key takes a single line of code and executes faster than a knight can say “Ni!” (Inside Python joke.)
  • Many other common operations also take only one or two lines of code, including conversion of categorical variables to 0/1 dummy variables, performing transformations and mathematical operations on variables, filling in or imputing missing data with constants or calculated values, pivoting data, and creating new variables from existing ones via concatenation (for strings) or math (for numbers).
  • With a script, you can also iterate over the rows of a data file and perform different operations based on conditional statements.

I’m not going to provide a Python tutorial today (although I’m tempted to do so in the future), but here is a sample line of code from a script, with a description of what it does. This doesn’t give you enough information to do anything useful, but you’ll at least see how compact and powerful the language is.

Skipping some necessary preliminaries, let’s say you’ve just used Python to query your Oracle database to read into memory a data set containing the variables ID, Constituent Category, Sex, and Age for all living constituent persons. (An operation that itself takes little more than two or three lines of code.) Obviously it depends on your database and code structure, but let’s say “Constituent Category” includes codes for such categories as Alumnus/na (ALUM), Non-degreed alumni (ALND), Parent (PRNT), Friend (FRND), Faculty (FCTY), Staff (STAF), and so on. And let’s further assume that a constituent can belong to multiple categories. Most people will have only one code, but it’s possible that a person can simultaneously be an alum, a parent, and a faculty member.

In our script, the data is read into a structure called a DataFrame (a tool provided by the pandas code library). This should sound familiar to users of R in particular. For the rest of us, a DataFrame is very much like a database table, with named columns and numbered (“indexed”) rows. Had we pasted the data into Excel instead, it might look like this:

pivot1

Right away we see that William and Janet are represented by multiple rows because they have multiple constituent codes. This won’t work for predictive modelling, which requires that we have just one row per individual – otherwise certain individuals would carry more weight in the model than they should. You could say that multiple records for Janet means that 60-year-old females are over-represented in the data. We could delete the extra rows, but we don’t want to do that because we’d be throwing away important information that is almost certainly informative of our modelling target, eg. likelihood to make a donation.

In order to keep this information while avoiding duplicate IDs, we need to pivot the data so that each category of Constituent Code (ALUM, PRNT, etc.) becomes its own column. The result we want would look like this in Excel:

pivot2

The Con_Code column is gone, and replaced with a series of columns, each a former category of Con_Code. In each column is either a 0 or 1, a “dummy variable” indicating whether an individual belongs to that constituency or not.

Getting the data from the first state to the final state requires just three lines of code in Python/pandas:

df = pd.merge(df, pd.crosstab(df.ID, df.Con_Code), how='left', left_on='ID', right_index=True)

df = df.drop(['Con_Code'], axis=1)

df = df.drop_duplicates()

This snippet of code may look invitingly simple or simply terrifying – it depends on your background. Whatever – it doesn’t matter, because my point is only that these three lines are very short, requiring very little typing, yet they elegantly handle a common data prep task that I have spent many hours performing manually.

Here’s a brief summary of what each line does:

Line 1: There’s a lot going on here … First, “df” is just the name of the DataFrame object. I could have called it anything. On the right-hand side, you see “pd” (which is shorthand for pandas, the module of code that is doing the work), then “crosstab,” (a function that performs the actual pivot). In the parentheses after pd.crosstab, we have specified the two columns to use in the pivot: df.ID is the data we want for the rows, and df.Con_Code is the column of categories that we want to expand into as many columns as there are categories. You don’t have to know in advance how many categories exist in your data, or what they are – Python just does it.

Pd.crosstab creates a new table containing only ID and all the new columns. That entity (or “object”) is just sitting out there, invisible, in your computer’s memory. We need to join it back to our original data set so that it is reunited with Age, Sex and whatever other stuff you’ve got. That’s what “pd.merge” does. Again, “pd” is just referencing the pandas module that is providing the “merge” function. The operation is called “merge,” but it’s much the same thing as an SQL-type join, familiar to anyone who queries a database. The merge takes two inputs, our original DataFrame (“df”), and the result from the crosstab operation that I described above. The argument called “how” specifies that we want to perform the equivalent of a left-join. A couple of other optional arguments explicitly tell Python which column to use as a join key (‘ID’).

The crosstab operation is enclosed within the merge operation. I could have separated these into multiple lines, which would have been less confusing, but my point is not to teach Python but to demonstrate how much you can accomplish with a trivial amount of typing. (Or copying-and-pasting, which works too!)

We’re not quite done, though. Our merged data is still full of duplicate IDs, because the Con_Code column is still present in our original data.

Line 2 deletes (“drops”) the entire column named Con_Code, and reassigns the altered DataFrame to “df” – essentially, replacing the original df with the new df created by the drop operation.

Now that Con_Code is gone, the “extra” rows are not just duplicates by ID, they are perfect duplicates across the entire row – there is nothing left to make two rows with the same ID unique. We are ready for the final step …

Line 3 deletes (or “drops”) every row that is a duplicate of a previous row.

Having accomplished this, another couple of lines near the end of the script (not shown) will write the data row by row into a new .csv file, which you can then import into your stats package of choice. If you had two dozen different constituent codes in your data, your new file will be wider by two dozen columns … all in the blink of an eye, without any need for Excel or any manual manipulation of the data.

Excel is perfectly capable of pivoting data like we see in the example, but for working with very large data sets and seamlessly merging the pivoted data back into the larger data file, I can’t think of a better tool than Python/pandas. As the data set gets bigger and bigger, the more need there is to stop working with it in tools that go to the extra work of DISPLAYING it. I suppose one of the beauties of Excel is that you can see the data as you are working on it. In fact, as I slowly built up my script, I repeatedly opened the .csv file in Excel just to have that visual inspection of the data to see that I was doing the right thing. But I inevitably reached the point at which the file was just too large for Excel to function smoothly. At 120,000 rows and 185 columns in a 90MB file, it was hardly Big Data – Excel could open the file no problem – but it was large enough that I wouldn’t want to do much filtering or messing with formulas.

On a quick first read, the code in the example above may seem impenetrable to a non-programmer (like me), but you don’t need to memorize a lot of functions and methods to write scripts in Python. Combing the Web for examples of what you want to do, using a lot of cut-and-paste, perhaps referring to a good book now and again – that’s all it takes, really.

That said, it does require time and patience. It took me many hours to cobble together my first script. I re-ran it a hundred times before I tracked down all the errors I made. I think it was worth it, though – every working piece of code is a step in the direction of saving untold hours. A script that works for one task often does not require much modification to work for another. (This cartoon says it all: Geeks and repetitive tasks.)

Beyond data preparation for predictive modelling, there are a number of directions I would like to go with Python, some of which I’ve made progress on already:

  • Merging data from multiple sources into data extract files for use in Tableau … With version 8.0 of the software comes the new Tableau API for building .tde files in Python. This was actually my first experiment with Python scripting. Using the TDE module and a combination of database queries and pandas DataFrames, you can achieve a high degree of automation for refreshing the most complex data sets behind your views and dashboards.
  • Exploring other modelling techniques besides my regular mainstay (regression) … I’ve long been intrigued by stuff such as neural networks, Random Forest, and so on, but I’ve been held back by a lack of time as well as some doubt that these techniques offer a significant improvement over what I’m doing now. Python gives ready access to many of these methods, allowing me to indulge my casual interest without investing a great deal of time. I am not a fan of the idea of automated modelling – the analyst should grasp what is going on in that black box. But I don’t see any harm in some quick-and-dirty experimentation, which could lead to solutions for problems I’m not even thinking of yet.
  • Taking advantage of APIs …. I’d like to try tapping into whatever social networking sites offer in the way of interfaces, and also programmatically access web services such as geocoding via Google.
  • Working with data sets that are too large for high-level applications such as Excel … I recently tried playing with two days’ worth of downloaded geocoded Twitter data. That’s MILLIONS of rows. You aren’t going to be using Excel for that.

I hope I’ve been able to transfer to you some of my enthusiasm for the power and potential of Python. I guess now you’ll be wondering how to get started. That’s not an easy question to answer. I could tell you how to download and install Python and an IDE (an integrated development environment, a user interface in which you may choose write, run, and debug your scripts), but beyond that, so much depends on what you want to do. Python has been extended in a great many directions – pandas for data analysis being just one of them.

However, it wouldn’t hurt to get a feel for how “core Python” works – that is, the central code base of the language along with its data types, object types, and basic operations such as “for” loops. Even before you bother installing anything, go to Codecademy.com and try a couple of the simple tutorials there.

For specific questions Google is your friend, but if you want a reference that covers all the basics in more or less plain English, I like “Learning Python” (4th Edition, but I see there’s a 5th Edition now) by Mark Lutz, published by O’Reilly. Another O’Reilly book, “Python for Data Analysis,” by Wes McKinney, describes how to crunch data with pandas and other related code libraries. (McKinney is the main author of the pandas library.)

I think readers new to programming (like me) will feel some frustration while learning to write their first scripts using any one book or resource. The Lutz book might seem too fine-grained in its survey of the basics for some readers, and McKinney is somewhat terse when offering examples of how various methods work. The problem is not with the books themselves – they’re wonderful. Consider that Python is used in web interfaces, robotics, database programming, gaming, financial markets, GIS, scientific programming, and probably every academic discipline that uses data – you must understand that core texts are perforce very general and abstract. (Think of grammar books for spoken languages.) It’s up to coders themselves to combine the basic building blocks in creative and powerful ways.

That said, after many, many hours spent hopping back and forth between these books, plus online tutorials and Python discussion forums – and just messing around on my own – I have figured out a few useful ways to accomplish some of the more common data preparation tasks that are specific to predictive modelling. Someday I would be happy to share – and, as always, to learn from the experience of others.

2 May 2013

New twists on inferring age from first name

Filed under: Analytics, Coolness, Data Desk, Fun — Tags: , , , — kevinmacdonell @ 6:14 am

Not quite three years ago I blogged about a technique for estimating the age of your database constituents when you don’t have any relevant data such as birth date or class year. It was based on the idea that many first names are typically “young” or “old.” I expanded on the topic in a followup post: Putting an age-guessing trick to the test. Until now, I’ve never had a reason to guess someone’s age — alumni data is pretty well supplied in that department. This very month, though, I have not one but two major modeling projects to work on that involve constituents with very little age data present. I’ve worked out a few improvements to the technique which I will share today.

First, here’s the gist of the basic idea. Picture two women, named Freda and Katelyn. Do you imagine one of them as older than the other? I’m guessing you do. From your own experience, you know that a lot of young women and girls are named Katelyn, and that few if any older women are. Even if you aren’t sure about Freda, you would probably guess she’s older. If you plug these names into babynamewizard.com, you’ll see that Freda was a very popular baby name in the early 1900s, but fell out of the Top 1000 list sometime in the 1980s. On the other hand, Katelyn didn’t enter the Top 1000 until the 1970s and is still popular.

To make use of this information you need to turn it into data. You need to acquire a lot of data on the frequency of first names and how young or old they tend to be. If you work for a university or other school, you’re probably in luck: You might have a lot of birth dates for your alumni or, failing that, you have class years which in most cases will be a good proxy for age. This will be the source you’ll use for guessing the age of everyone else in your database — friends, parents and other person constituents — who don’t have ages. If you have a donor database that contains no age data, you might be able to source age-by-first name data somewhere else.

Back to Freda and Katelyn … when I query our database I find that the average age of constituents named Freda is 69, while the average age for Katelyn is 25. For the purpose of building a model, for anyone named Freda without an age, I will just assume she is 69, and for anyone named Katelyn, 25. It’s as simple as creating a table with two columns (First name and Average age), and matching this to your data file via First Name. My table has more than 13,500 unique first names. Some of these are single initials, and not every person goes by their first name, but that doesn’t necessarily invalidate the average age associated with them.

I’ve tested this method, and it’s an improvement over plugging missing values with an all-database average or median age. For a data set that has no age data at all, it should provide new information that wasn’t there before — information that is probably correlated with behaviours such as giving.

Now here’s a new wrinkle.

In my first post on this subject, I noted that some of the youngest names in our database are “gender flips.” Some of the more recent popular names used to be associated with the opposite gender decades ago. This seems to be most prevalent with young female names: Ainslie, Isadore, Sydney, Shelly, Brooke. It’s harder to find examples going in the other direction, but there are a few, some of them perhaps having to do with differences in ethnic origin: Kori, Dian, Karen, Shaune, Mina, Marian. In my data I have close to 600 first names that belong to members of both sexes. When I calculate average age by First Name separately for each sex, some names end up with the exact same age for male and female. These names have an androgynous quality to them: Lyndsay, Riley, Jayme, Jesse, Jody. At the other extreme are the names that have definitely flipped gender, which I’ve already given examples of … one of the largest differences being for Ainslie. The average male named Ainslie is 54 years older than the average female of the same name. (In my data, that is.)

These differences suggest an improvement to our age-inferring method: Matching on not just First Name, but Sex as well. Although only 600 of my names are double-gendered, they include many popular names, so that they actually represent almost one-quarter of all constituents.

Now here’s another wrinkle.

When we’re dealing with constituents who aren’t alumni, we may be missing certain personal information such as Sex. If we plan to match on Sex as well as First Name, we’ve got a problem. If Name Prefix is present, we can infer from whether it’s Mr., Ms., etc., but unless the person doing the data entry was having an off day, this shouldn’t be an avenue available to us — it should already be filled in. (If you know it’s “Mrs.,” then why not put in F for Sex?) For those records without a Sex recorded (or have a Sex of ‘N’), we need to make a guess. To do so, we return to our First Names query and the Sex data we do have.

In my list of 600 first names that are double-gendered, not many are actually androgynous. We have females named John and Peter, and we have males named Mary and Laura, but we all know that given any one person named John, chances are we’re talking about a male person. Mary is probably female. These may be coding errors or they may be genuine, but in any case we can use majority usage to help us decide. We’ll sometimes get it wrong — there are indeed boys named Sue — but if you have 7,000 Johns in your database and only five of them are female, then let’s assume (just for the convenience of data mining*) that all Johns are male.

So: Query your database to retrieve every first name that has a Sex code, and count up the instance of each. The default sex for each first name is decided by the highest count, male or female. To get a single variable for this, I subtract the number of females from the number of males for each first name. Since the result is positive for males and negative for females, I call it a “Maleness Score” — but you can do the reverse and call it a Femaleness Score if you wish! Results of zero are considered ties, or ‘N’.

At this point we’ve introduced a bit of circularity. For any person missing Age and Sex, first we have to guess their sex based on the majority code assigned to that person’s first name, and then go back to the same data to grab the Age that matches up with Name and Sex. Clearly we are going to get it very wrong for a lot of records. You can’t expect these guesses to hold up as well as true age data. Overall, though, there should be some signal in all that noise … if your model believes that “Edgar” is male and 72 years of age, and that “Brittany” is female and 26, well, that’s not unreasonable and it’s probably not far from the truth.

How do we put this all together? I build my models in Data Desk, so I need to get all these elements into my data file as individual variables. You can do this any way that works for you, but I use our database querying software (Hyperion Brio). I import the data into Brio as locally-saved tab-delimited files and join them up as you see below. The left table is my modeling data (or at least the part of it that holds First Name), and the two tables on the right hold the name-specific ages and sexes from all the database records that have this information available. I left-join each of these tables on the First Name field.

age_tablesWhen I process the query, I get one row per ID with the fields from the left-hand table, plus the fields I need from the two tables on the right: the so-called Maleness Score, Female Avg Age by FName, Male Avg Age by Fname, and N Avg Age by Fname. I can now paste these as new variables into Data Desk. I still have work to do, though: I do have a small amount of “real” age data that I don’t want to overwrite, and not every First Name has a match in the alumni database. I have to figure out what I have, what I don’t have, and what I’m going to do to get a real or estimated age plugged in for every single record. I write an expression called Age Estimated to choose an age based on a hierarchical set of IF statements. The text of my expression is below — I will explain it in plain English following the expression.

if len('AGE')>0 then 'AGE'

else if textof('SEX')="M" and len('M avg age by Fname')>0 then 'M avg age by Fname'
else if textof('SEX')="M" and len('N avg age by Fname')>0 then 'N avg age by Fname'
else if textof('SEX')="M" and len('F avg age by Fname')>0 then 'F avg age by Fname'

else if textof('SEX')="F" and len('F avg age by Fname')>0 then 'F avg age by Fname'
else if textof('SEX')="F" and len('N avg age by Fname')>0 then 'N avg age by Fname'
else if textof('SEX')="F" and len('M avg age by Fname')>0 then 'M avg age by Fname'

else if textof('SEX')="N" and 'Maleness score'>0 and len('M avg age by Fname')>0 then 'M avg age by Fname'
else if textof('SEX')="N" and 'Maleness score'<0 and len('F avg age by Fname')>0 then 'F avg age by Fname'
else if textof('SEX')="N" and 'Maleness score'=0 and len('N avg age by Fname')>0 then 'N avg age by Fname'

else if len('N avg age by Fname')>0 then 'N avg age by Fname'
else if len('F avg age by Fname')>0 then 'F avg age by Fname'
else if len('M avg age by Fname')>0 then 'M avg age by Fname'

else 49

Okay … here’s what the expression actually does, going block by block through the statements:

  1. If Age is already present, then use that — done.
  2. Otherwise, if Sex is male, and the average male age is available, then use that. If there’s no average male age, then use the ‘N’ age, and if that’s not available, use the female average age … we can hope it’s better than no age at all.
  3. Otherwise if Sex is female, and the average female age is available, then use that. Again, go with any other age that’s available.
  4. Otherwise if Sex is ‘N’, and the Fname is likely male (according to the so-called Maleness Score), then use the male average age, if it’s available. Or if the first name is probably female, use the female average age. Or if the name is tied male-female, use the ‘N’ average age.
  5. Otherwise, as it appears we don’t have anything much to go on, just use any available average age associated with that first name: ‘N’, female, or male.
  6. And finally, if all else fails (which it does for about 6% of my file, or 7,000 records), just plug in the average age of every constituent in the database who has an age, which in our case is 49. This number will vary depending on the composition of your actual data file — if it’s all Parents, for example, then calculate the average of Parents’ known ages, excluding other constituent types.

When I bin the cases into 20 roughly equal groups by Estimated Age, I see that the percentage of cases that have some giving history starts very low (about 3 percent for the youngest group), rises rapidly to more than 10 percent, and then gradually rises to almost 18 percent for the oldest group. That’s heading in the right direction at least. As well, being in the oldest 5% is also very highly correlated with Lifetime Giving, which is what we would expect from a donor data set containing true ages.

est_age_vingt

This is a bit of work, and probably the gain will be marginal a lot of the time. Data on real interactions that showed evidence of engagement would be superior to age-guessing, but when data is scarce a bit of added lift can’t hurt. If you’re concerned about introducing too much noise, then build models with and without Estimated Age, and evaluate them against each other. If your software offers multiple imputation for missing data as a feature, try checking that out … what I’m doing here is just a very manual form of multiple imputation — calculating plausible values for missing data based on the values of other variables. Be careful, though: A good predictor of Age happens to be Lifetime Giving, and if your aim is to predict Giving, I should think there’s a risk your model will suffer from feedback.

* One final note …

Earlier on I mentioned assuming someone is male or female “just for the convenience of data mining.”  In our databases (and in a conventional, everyday sense too), we group people in various ways — sex, race, creed. But these categories are truly imperfect summaries of reality. (Some more imperfect than others!) A lot of human diversity is not captured in data, including things we formerly thought of as clear-cut. Sex seems conveniently binary, but in reality it is multi-category, or maybe it’s a continuous variable. (Or maybe it’s too complex for a single variable.) In real life I don’t assume that when someone in the Registrar’s Office enters ‘N’ for Sex that the student’s data is merely missing. Because the N category is still such a small slice of the population I might treat it as missing, or reapportion it to either Male or Female as I do here. But that’s strictly for predictive modeling. It’s not a statement about transgendered or differently gendered people nor an opinion about where they “belong.”

13 December 2011

Finding connections to your major gift prospects in your data

Guest post by Erich Preisendorfer, Associate Director, Business Intelligence, Advancement Services, University of New Hampshire

(Thanks to Erich for this guest post, which touches on something a lot of prospect researchers are interested in: mapping relationships to prospects in their database. Actually, this work is more exciting than that, because it actually helps people find connections they may not have known about, via database queries and a simple scoring system. Is your Advancement Services department working on something like this? Why not ask them? — Kevin.)

Data miners often have an objective of exploring sets of data to determine meaningful patterns which can then be modeled for predictive patterning, hopefully to help meet their organization’s end goal(s).  However, there may be a time when the end behavior is not inherent in your database. Such a situation recently came up for my Advancement organization.

Our prospecting team recently started a program wrapped around peer recommendations: A prospect recommends new suspects to us based on the prospect’s interactions with the suspects. The question then became, what can we provide to the prospect to help get them thinking about potential suspects?

We currently do not have any type of data which would allow us to say, “Yes, this is what a relationship looks like,” outside of family relationships. We had to find a different way to identify potential acquaintances. I looked back at my own relationships to determine how I know the people I know. My friends and acquaintances largely come from some basic areas: school, work, places I’ve gone, etc.

Transforming my experience with relationships into what we have for useable data, I saw three key areas where relationships may exist: work history, education history, and extracurricular activities including one-time events. Fortunately, I was able to pinpoint our constituents’ time in each of the above areas to help isolate meaningful, shared experiences amongst constituents. Our work and extracurricular history includes to/from dates, and we have loads of educational history data that includes specific dates. Using this data, I am able to come up with potential relationships from a single prospect.

Prospect Profile (generated by entering a single prospect’s ID):

  • John Adams
  • Widget Factory, Employee 01/05/1971 – 06/16/1996
  • Student Activities: Football, Student Senate 09/1965-05/1966
  • Bachelor of Arts, Botany 1966

Potential Relationships (each item below is a separate query, using the Prospect Profile results):

  • Those employed by the Widget Factory who started before John ended, and ended after John began.
  • Those students who participated in Football and had a class year within +/-3 years of John.
  • Those students in Student Senate at the same time as John, similar to the Widget Factory example.
  • Those students who were in the same class year as John.
  • Those students who share John’s major.

Currently,since I have no way of proving the value of one point of contact over the other, each row returned in the potential relationships earns the constituent one point. Since my database stores historical records, I may get more than one row per constituent in any one category if they met more than one of John’s associated records – say they participated in Student Senate and played Football. This is great, because I want to give those particular constituents two points since they have more than one touch point in common with John.

I end up with a ranked list of constituents who share potential relationship contacts with my main prospect. The relationship lists provide our prospect researchers a starting point in putting together a solid list of high capacity constituents a single person may have some sort of relationship with, thus a greater insight into potential giving.

As of now, the report is in its infancy but looks to have high potential. As we grow the concept, there are multiple data points where further exploration could result in a higher level of functioning. As prospects use the lists to identify people they know, we can then deconstruct those choices to determine what is more likely a relationship. Should shared employment be ranked higher than shared class year? Should Football rank higher than Student Senate? I would guess yes, but I currently do not have supporting data to make that decision.

Another interesting concept, raised at the recent DRIVE 2011 conference, would be: “How are these two prospects potentially related by a third constituent?”  The result could mean the difference between two separate, forced conversations and one single conversation with three prospects shared over nostalgic conversations, drinks and, hopefully, money in the door!

Erich Preisendorfer is Associate Director, Business Intelligence, working in Advancement Services at the University of New Hampshire.

Older Posts »

Blog at WordPress.com.