CoolData blog

28 June 2015

Data mining in the archives

Filed under: Data, Predictor variables — Tags: , , , — kevinmacdonell @ 6:24 pm

 

When I was a student, I worked in a university archives to earn a little money. I spent many hours penciling consecutive index numbers onto acid-free paper folders, on the ultra-quiet top floor of the library. It was as dull a job as one can imagine.

 

Today’s post is not about that kind of archive. I’m talking about database archive views, also called snapshots. They’re useful for reporting and business intelligence, but they can also play a role in predictive modelling.

 

What is an archive view?

 

Think of a basic stat such as “number of living alumni”. This number changes constantly as new alumni join the fold and others are identified as deceased. A straightforward query will tell you how many living alumni there are, but that number will be out of date tomorrow. What if someone asks you how many living alumni you had a year ago? Then it’s necessary to take grad dates and death dates into account in order to generate an estimate. Or, you look the number up in previously-reported statistics.

 

A database archive view makes such reporting relatively easy by preserving the exact status of a record at regular points in time. The ideal archive is a materialized view in a data warehouse. On a given schedule (yearly, quarterly, or even monthly), an automated process adds fresh rows to an archive table that keeps getting longer and longer. You’re likely reliant on central IT services to set it up.

 

“Number of living alumni” is an important denominator for such key ratios as the percentage of alumni for whom you have contact information (mail, phone, email) and participation rates (the proportion of alumni who give). Every gift is entered as an individual transaction record with a specific date, which enables reporting on historical giving activity. This tends not to be true of contact information. Even though mailing addresses may be added one after another, without overwriting older addresses, the key piece of information is whether the address is coded ‘valid’ or ‘invalid’. This changes all the time, and your database may not preserve a history of those changes. Contact information records may have “To” and “From” dates associated with them, but your query will need to do a lot of relative-date calculations to determine if someone was both alive and had a valid address for any given point in time in the past.

 

An archive table obviates the need for this complex logic, and ours looks like the example below. There’s the unique ID of each individual, the archive date, and a series of binary indicator variables — ‘1’ for “yes, this data is present” or ‘0’ for “this data is absent”.

 

archive

 

Here we see three individuals and how their data has changed over three months in 2015. This is sorted by ID rather than by the order in which the records were added to the archive, so that you can see the journey each person has taken in that time:

 

  • A00001 had no valid email in the database in February and March, but we obtained it in time for the April 1 snapshot.
  • A00002 had no contact information at all until just before March, when a phone append supplied us with a new number. The number proved to be invalid, however, and when we coded it as such in the database, the indicator reverted to zero.
  • A00003 appeared in our data in February and March, but that person was coded deceased in the database before April 1, and was excluded from the April snapshot.

 

That last bullet point is important. Once someone has died, continuing to include them as a row in the archive every month would be a waste of resources. In your reporting software, a simple count of records by archive date will give you the number of living alumni. A simple count of ‘Address Indicator’ will give you the number of alumni with valid addresses. Dividing the number of valid addresses by the number of living alumni (and multiplying by 100) will give you the percentage of living alumni that are addressable for that month. (Reporting software such as Tableau will make very quick work of all this.)

 

Because an archive view preserves changing statuses over time at the level of the individual constituent, it can be used for reporting trends along any slice you choose (age bracket, geography, school, etc.), and can play a role in staff activity/performance reporting and alumni engagement scoring.

 

But enough about archive views themselves. Let’s talk about using them for predictive modelling.

 

In the archive example above, you see a bunch of 0/1 indicator variables. Indicator variables are common in predictive modelling. For example, “Mailing address present” can have one of two states: Present or not present. It’s binary. A frequency breakdown of my data at this point in time looks like this (in Data Desk):

 

freq1

 

About 78% of living alumni have a valid address in the database today — the records with an address indicator of ‘1’. As you might expect, alumni with a good address are more likely to have given than alumni without, and they have much higher lifetime giving on average. In the models I build to predict likelihood to give (and give at higher levels), I almost always make use of this association between contact information and giving.

 

But what about using the archive view data instead? The ‘Address Indicator’ variable breakdown above shows me the current situation, but the archive view adds depth by going back in time. Our own archive has been taking monthly snapshots since December of last year — seven distinct points in time. Summing on “Address Indicator” for each ID shows that large numbers of alumni have either never had a valid address during that time (0 out of 7 months), or always did (7 out of 7). The rest had a change of status during the period, and therefore fall between 0 and 7:

 

freq2

 

A few hundred alumni (387) had a valid address in one out of seven months, 143 had a valid address in two out of seven — and so on. Our archive is still very young; only about 1% of alumni have a count that is not 0 or 7. A year from now, we can expect to see far more constituents populating the middle ground.

 

What is most interesting to me is an apparent relationship between “number of months with valid address” (x-axis) and average lifetime giving (y-axis), even with the relative scarcity of data:

 

chart1

 

My real question, of course, is whether these summed, continuous indicators really make much of a difference in a model over simply using the more familiar binary variables. The answer is “not yet — but someday.” As I noted earlier, only about 1% of living alumni have changed status in the past seven months, so even though this relationship seems linear, the numbers aren’t there to influence the strength of correlation. The Pearson correlation for “Address Indicator” (0/1) and “Lifetime Giving” is 0.186, which is identical to the Pearson correlation for “Address Count” (0 to 7) and “Lifetime Giving.” For all other variables except one, the archive counts have only very slightly higher correlations with Lifetime Giving than the straight indicator variables. (Email is slightly lower.)

 

It’s early days yet. All I can say is that there is potential. Have a look at this pair of regression analyses, both using Lifetime Giving (log-transformed) as the dependent variable. (Click on image for larger view.) In the window on the left, all the independent variables are the regular binary indicator variables. On the right, the independent variables are counts from our archive view. The difference in R-squared from one model to the other is very slight, but headed in the right direction: From 12.7% to 13.0%.

 

regressions

 

Looking back on my student days, I cannot deny that I enjoyed even the quiet, dull hours spent in the university archives. Fortunately, though, and due in no small part to cool data like this, my work since then has been a lot more interesting. Stay tuned for more from our archives.

 

Advertisement

20 July 2014

Eliminate your duplicate data row problems with simple SQL

Filed under: Data, SQL — Tags: , , , , , , , — kevinmacdonell @ 1:38 pm

We’ve all faced this problem: Duplicate rows in our data. It messes up our reports and causes general confusion and consternation for people we supply data to. The good news is you can take care of this problem fairly easily in an SQL query using a handy function called LISTAGG. I discovered this almost by accident just this week, and I’m delighted to share.

Duplicate rows are a result of what we choose to include in our database query. If we are expecting to get only one row for each constituent ID, then we should be safe in asking for Gender, or Name Prefix, or Preferred Address. There should be only one of each of these things for each ID, and therefore there won’t be duplicate rows. These are examples of one-to-one relationships.

We get into trouble when we ask for something that can have multiple values for a single ID. That’s a one-to-many relationship. Examples: Any address (not just Preferred), Degree (some alumni will have more than one), Category Code (a person can be an alum and a parent and a staff member, all at the same time) … and so on. Below, the first constituent is duplicated on Category, and the second constituent is duplicated on Degree.

dup1

That’s not the worst of it; when one ID is duplicated on two or more elements, things get really messy. If A0001 above had three Category codes and two degrees, it would be appear six times (3 x 2) in the result set. Notice that we are really talking about duplicate IDs — not duplicate rows. Each row is, in fact, unique. This is perfectly correct behaviour on the part of the database and query. Try explaining that to your boss, however! You’ve got to do something to clean this up.

What isn’t going to work is pivoting. In SQL, the PIVOT function creates a new column for each unique category of some data element, plus an aggregation of the values for that category. I’ve written about this here: Really swell SQL: Why you must know PIVOT and WITH. Using PIVOT is going to give you something like the result below. (I’ve put in 1’s and nulls to indicate the presence or absence of data.)

dup2

 

What we really want is THIS:

dup3

The answer is this: String aggregation. You may already be familiar with aggregating quantitative values, usually by counting or summing. When we query on a donor’s giving, we usually don’t want a row for each gift — we want the sum of all giving. Easy. For strings — categorical data such as Degree or Category Code — it’s a different situation. Sure, we can aggregate by using counting. For example, I can ask for a count of Degree, and that will limit the data to one row per ID. But for most reporting scenarios, we want to display the data itself. We want to know what the degrees are, not just how many of them there are.

The following simplified SQL will get you the result you see above for ID and Category — one row per ID, and a series of category codes separated by commas. This code works in Oracle 11g — I can’t speak to other systems’ implementations of SQL. (For other Oracle-based methods, check out this page. To accomplish the same thing in a non-Oracle variant of SQL called T-SQL, scroll down to the Postscript below.)

Obviously you will need to replace the sample text with your own schema name, table names, and field identifiers.

SELECT
SCHEMA.ENTITY.ID,
LISTAGG ( SCHEMA.ENTITY.CATG_CODE, ', ' )
WITHIN GROUP ( ORDER BY SCHEMA.ENTITY.CATG_CODE ) AS CATEGORY

FROM SCHEMA.ENTITY

GROUP BY SCHEMA.ENTITY.ID

The two arguments given to LISTAGG are the field CATG_CODE and a string consisting of a comma and a space. The string, which can be anything you want, is inserted between each Category Code. On the next line, ORDER BY sorts the category codes in alphabetical order.

LISTAGG accepts only two arguments, but if you want to include two or more fields and/or strings in the aggregation, you can just concatenate them inside LISTAGG using “pipe” characters (||). The following is an example using Degree Code and Degree Code Description, with a space between them. To get even fancier, I have replaced the comma separator with the character code for “new line,” which puts each degree on its own line — handy for readability in a list or report. (This works fine when used in Custom SQL in Tableau. For display in Excel, you may have to use a carriage return — char(13) — instead of a line feed.) You will also notice that I have specified a join to a table containing the Degree data.

SELECT
SCHEMA.ENTITY.ID,
LISTAGG ( SCHEMA.DEGREE.DEGC_CODE || ' ' || SCHEMA.DEGREE.DEGC_DESC, chr(10) )
WITHIN GROUP ( ORDER BY SCHEMA.DEGREE.DEGC_CODE, SCHEMA.DEGREE.DEGC_DESC ) AS DEGREES

FROM SCHEMA.ENTITY

INNER JOIN 
SCHEMA.DEGREE 
ON ( SCHEMA.ENTITY.ID = SCHEMA.DEGREE.ID )

GROUP BY SCHEMA.ENTITY.ID

dup_final

Before I discovered this capability, the only other automated way I knew how to handle it was in Python. All the previous methods I’ve used were at least partially manual or just very cumbersome. I don’t know how long string aggregation in Oracle has been around, but I’m grateful that a data problem I have struggled with for years has finally been dispensed with. Hurrah!

~~~~~~

POSTSCRIPT

After publishing this I had an email response from David Logan, Director of Philanthropic Analytics at Children’s Mercy in Kansas City, Missouri. He uses TSQL, a proprietary procedural language used by Microsoft in SQL Server. Some readers might find this helpful, so I will quote from his email:

I was reading your post this morning about using the LISTAGG function in Oracle SQL to concatenate values from duplicate rows into a list. A couple of months ago I was grappling with this very problem so that I could create a query of Constituents that included a list of all their phone numbers in a single field. LISTAGG is not available for those of us using T-SQL … the solution for T-SQL is to use XML PATH().

Here’s a link to a source that demonstrates how to use it: http://sqlandme.com/2011/04/27/tsql-concatenate-rows-using-for-xml-path/.

One key difference in this method from LISTAGG is that the delimiter character (such as the comma—or characters as in your example where you have comma space) is added before each value returned so the raw string returned will have this character before the first value. You can use the STUFF() function to eliminate this first delimiter.

Here’s my solution using XML PATH() to return a list of phone numbers from our donor database:

SELECT
CA.ID,
STUFF
(
  (
  SELECT
   '; ' + PH.NUM + ' [' + re7_ReportingTools.dbo.GetLongDescription(PH.PHONETYPEID) + ']'
  FROM
   re7_db.dbo.CONSTIT_ADDRESS_PHONES CAP
    INNER JOIN
     re7_db.dbo.PHONES PH
     ON CAP.PHONESID=PH.PHONESID
  WHERE
     CAP.CONSTITADDRESSID=CA.ID
    AND
     PH.DO_NOT_CALL=0
  ORDER BY CAP.SEQUENCE
  FOR XML PATH('')
  ),1,2,''
) PHONENUMS
FROM
re7_db.dbo.CONSTIT_ADDRESS CA
 
GROUP BY CA.ID

I concatenated my list using a semi-colon followed by a space ‘; ‘. I used the STUFF() function to remove the leading semi-colon and space.

Here are a few sample rows that are returned (note I’ve “greeked” the prefixes for donor privacy):

table

25 June 2014

How our sector is getting its butt kicked by just about everyone

Filed under: Analytics, Data, Off on a tangent, skeptics — kevinmacdonell @ 8:24 pm

There isn’t a lot to do at my wife’s family summer cottage when it rains, especially if I’ve forgotten to bring a book. I find myself scanning the shelves for something — anything — to read. On one such recent rainy weekend, I picked up a book my niece had left on a table. It was a heavy hardcover textbook, and it contained a mild surprise.

What I found was an introduction to such topics as liner and non-linear relationships, probability, scatterplots, best-fit lines, and correlation — concepts that I’ve come to have a deep interest in, mainly because I have profitably put them to work in the service of fundraising and alumni engagement.

Was this a college textbook? A manual for budding data scientists?

No, not at all. My niece is in Grade 9, and this was her mathematics textbook.

I don’t know if the Nova Scotia math curriculum is typical, nor am I qualified to judge the quality of a textbook. And my niece may not be thrilled about learning statistics. But some group of experts in math education apparently believe these concepts are well within the grasp of young Nova Scotian minds. Power to them.

What does this have to do with you? Yes, plastic young minds may grasp with relative ease what we oldsters struggle with (new languages, for example), but we have one distinct advantage. Where adolescents view these concepts as abstractions without a purpose, we may immediately see how we can use them to advance our causes, and our careers.

Yet, we all know otherwise intelligent people in our field whose eyes glaze over when they see a chart or hear anything that sounds like math — even Grade 9 math. Somehow, we must be failing to demonstrate the connection between analytics and success in fundraising and alumni engagement.

So in what fields is analytics really taking root? Well, every field. Including farmers’ fields.

Food production has been a focus of science and statistics for many decades. But today it’s not confined to experimental farms or the labs of agribusiness companies. Real, honest-to-goodness farmers are enthusiastic quants compared to most of us working in the nonprofit sector.

tweetJennifer Cunningham @jenlynham is Senior Director, Metrics and Marketing in the Office of Alumni Affairs at Cornell University. In a recent email to me and my “Score!” co-writer Peter Wylie she writes: “Just gave a talk today at the National Agricultural Alumni and Development Association (NAADA) conference. Went on a hayride this afternoon with the group here at Penn State. The farmers here are using data like you wouldn’t believe. Guys have been farming for 30+ years and they’re going on and on about the importance of measuring input vs output … it’s so interesting to hear these old-school guys go on about the importance of it in their worlds. And yet, some people in our industry, raising billions, still don’t get it?!?!”

It’s a fair observation.

I have a lot of time for people who are not enamoured with analytics due to an unfamiliarity with working with numbers. They require explanations and justifications for using analytical methods. That’s fine. I myself didn’t see math as having much to do with my working life until I entered my mid-thirties, and sometimes I still think the right story beats numbers.

But like our friend Jennifer, I feel less sympathy for ignorance when it’s a deliberate choice. There’s a line where lack of interest in data equates to wilful illiteracy. Someday soon, being on the wrong side of that line is going to disqualify a person from working for important causes.

23 February 2014

Really swell SQL: Why you must know PIVOT and WITH

Filed under: Data, SQL — Tags: , , , — kevinmacdonell @ 2:37 pm

Some data manipulation needs are so common, and their solutions so elusive, that when I find an answer, I just have to share. Today I  will show you two problems and their solutions, wrapped up into one.

The first problem: Given a database query pulling any kind of transactional data (involving dollars, say), how do you go about aggregating the data in more than one way, within the confines of a single query, so that the various aggregations appear on one row of data?

The second problem: Given a database query that returns rows that are duplicates (by constituent ID, for example) due to some category that appears multiple times per ID (category code, gift year or whatever), how do you get rid of those duplicate rows without losing any of the data? In other words, how do you pivot the data so that the categories are added as columns rather than rows, while preserving the structure of all the other data that you don’t want to pivot?

That’s as clear as I can describe it … believe me, if you work with data, you encounter these situations all the time. An illustration might be helpful. Here is some donor gift data, limited to just Donor ID, Fiscal Year Code, and Giving Total, which is the sum of Gift Amount by ID and fiscal year. (For the sake of simplicity, I am using a code for fiscal year instead of extracting the fiscal year from actual gift dates, and I am limiting the data to three fiscal years of donation history.)

fig1

This is fine, but notice that some IDs appear on multiple rows. What if we want only one row of data per donor? And what if we want an overall giving total for the donor in one column, and then yearly totals for the three fiscal years in our data — one column and total per year? Like this:

fig2

Today I’ll show you a couple of techniques to combine different types of aggregations into a single, short SQL statement that isn’t difficult to read. (All of this works in Oracle 11g — I can’t speak to other systems’ implementations of SQL.) My primary need for writing such a query is for embedding custom SQL in a Tableau data source for the purpose of reporting. I don’t really require fancy SQL to render this precise example in Tableau. This is just a very simple example which you can adapt to more complex situations.

The SQL statement that produced our first set of results looks like this. Again for simplicity, let’s assume we require only one table, called GIFTS, so there are no joins to consider:

SELECT GIFTS.ID,
 SUM ( GIFTS.GIFT_AMT ) AS GIVING_TOTAL,
 GIFTS.FISC_YR_CODE

 FROM
 GIFTS

 WHERE 
 GIFTS.FISC_YR_CODE IN ('FY11', 'FY12', 'FY13')

 GROUP BY
 GIFTS.ID,
 GIFTS.FISC_YR_CODE

We end up with one type of aggregation: The SUM of all gifts, grouped by ID and Fiscal Year. But there is no way to include a SUM by ID only, because if we ask for FISC_YR_CODE in the SELECT, we have to include it in the GROUP BY as well. We will deal with that issue a little later on.

First let’s deal with the duplicate IDs. Each donor ID appears once for every fiscal year out of the three that the donor made a gift. We want to pivot the data, so that the rows become columns instead. Introducing … the PIVOT operator! Have a look at the following SQL:

SELECT *

 FROM (

  SELECT GIFTS.ID,
  GIFTS.GIFT_AMT,
  GIFTS.FISC_YR_CODE

  FROM
  GIFTS

  WHERE 
  GIFTS.FISC_YR_CODE IN ('FY11', 'FY12', 'FY13')

 )

PIVOT (
  SUM ( GIFT_AMT ) AS year FOR ( FISC_YR_CODE )
  IN ( 'FY11' AS FY11, 'FY12' AS FY12, 'FY13' AS FY13 )
  )

The inner SELECT statement gets all the data and hands it to the PIVOT operator, and the outer SELECT asks for all the resulting columns from that pivot. The three categories we want to appear as columns (that is, each fiscal year) are specified with IN, and the word “year” is appended to each column name. (This text can be anything you want.) I have added aliases (using AS) in order to prevent the single quote marks, which are required, from appearing in the results.

The content of the “cells” will be the SUM of gift amounts for each ID and fiscal year. The result will look like this:

fig3

Useful, eh? PIVOT does have some limitations, which I will discuss later. Let’s press on. We’re definitely getting close to realizing our goal. Now we just need to have one other aggregation (total giving by donor) appear on the same line. The problem, noted earlier, is that to SUM on giving by ID only, we are forced to leave out all other columns, in this case the fiscal year code. The SQL for giving by donor looks like this:

SELECT GIFTS.ID,
 SUM ( GIFTS.GIFT_AMT ) AS GIVING_TOTAL

 FROM
 GIFTS

 WHERE 
 GIFTS.FISC_YR_CODE IN ('FY11', 'FY12', 'FY13')

 GROUP BY
 GIFTS.ID

Clearly we need not one but two queries, the first to get total giving by donor, and the second to get giving by ID and year (pivoted), and then join the two result sets as if they were tables. Normally this would call for creating temporary tables in the database to store the results for further querying, but data analysts are not DBAs: We may not have sufficient permissions to perform this operation on the database. (As far as I am aware, by default Tableau disallows table creation, perhaps to reassure database admins?)

Fortunately there is a solution. It’s not well-known — my fat Oracle 11g SQL reference book doesn’t even mention it — so you’ll need to do some online searches to find out more. Introducing … the WITH clause!

It works like this. Each WITH clause contains a SELECT statement that returns a result set that behaves like a temporary table, and which can be referenced in other SELECT statements. This example uses two clauses, aliased as ‘total_giving’ and ‘yearly_giving’. A final SELECT statement joins the two as if they were tables.

WITH
 total_giving AS
 (

SELECT GIFTS.ID,
 SUM ( GIFTS.GIFT_AMT ) AS GIVING_TOTAL

 FROM
 GIFTS

 WHERE 
 GIFTS.FISC_YR_CODE IN ('FY11', 'FY12', 'FY13')

 GROUP BY
 GIFTS.ID

 ),

 yearly_giving AS
 (

SELECT *

 FROM (

  SELECT GIFTS.ID,
  GIFTS.GIFT_AMT,
  GIFTS.FISC_YR_CODE

  FROM
  GIFTS

  WHERE 
  GIFTS.FISC_YR_CODE IN ('FY11', 'FY12', 'FY13')

  GROUP BY
  GIFTS.ID,
  GIFTS.FISC_YR_CODE

 )

PIVOT (
  SUM ( GIFT_AMT ) AS year FOR (FISC_YR_CODE)
  IN ('FY11' AS FY11, 'FY12' AS FY12, 'FY13' AS FY13)
  )
 )

SELECT
 total_giving.ID,
 total_giving.GIVING_TOTAL,
 yearly_giving.FY11_YEAR,
 yearly_giving.FY12_YEAR,
 yearly_giving.FY13_YEAR

 FROM
 total_giving,
 yearly_giving

 WHERE
 total_giving.ID = yearly_giving.ID

You can accomplish what I’ve done in this example without using the WITH clause, but as an SQL statement gets more complex, it gets harder to read. As far as I know, you can use as many WITH clauses as you like, allowing you to build complex queries while preserving the neat organization and therefore readability of the SQL statement. Readability is important when you must modify or debug the statement weeks later. Oracle error messages are maddeningly uninformative!

I encourage you to research WITH and PIVOT on your own, but here are a few pointers:

  • Type ‘WITH’ only once. Multiple WITH clauses are separated by a comma. (No comma following the last clause.)
  • A WITH clause can reference any previously-defined WITH clause as if it were an already-existing table. Very useful.
  • When using PIVOT, notice that we must specify all the categories in the pivot (fiscal year code, in this case). Unfortunately, PIVOT is not capable of dynamically pivoting whatever data it happens to find. This is just the way SQL works. If you want to pivot data without knowing in advance what the categories are, you may have to use a true programming language. I use Python, but that is not always an option.
  • You can use any aggregation you like in the PIVOT — COUNT, SUM, MAX, etc.
  • Don’t include the schema and table names in the identifiers named in PIVOT, or you will get an error.

I hope this has been of some help. The nuts and bolts of pulling data for analysis may not be the sexiest part of a data analyst’s job, but brushing up on these skills (whatever your database system is) will pay off in the long run.

30 July 2013

Getting bitten by Python

When I was first learning to build predictive models, preparing the data was part of the adventure. In time, though, many operations on the data became standard instead of exploratory. Eventually they became simply repetitive and tedious. When any task becomes repetitive, I think of ways to automate it. Given that data prep makes up 80 percent of the work of building a model (according to some authors), the benefits of automation are obvious.

I can think of only two ways to replicate the manual operations you need to perform on a large data set to make it ready for modelling: Use software specially designed for the task, or code your own data-handling scripts. I am lazy and drawn to software solutions that make hard things easy, and I’m not a programmer. Yet I have veered away from a ready-made software solution to pursue an interest in the scripting language called Python, and in particular the Python code library called pandas, written specifically for working with data.

Maybe it’s because Python is open-source and free, or because it is powerful, or because it is flexible and widely adaptable to multiple uses on the job. I don’t know. But for the past few months I’ve been obsessed with learning to use it, and that’s what I’d like to talk about today.

I’m guessing very few CoolData readers have experience writing scripts for handling data. I know some people who do most of their stats work in the R language or manipulate data in Excel using VBA. But the majority of readers probably consider themselves severely allergic to coding of any kind. I concede that it isn’t for everyone, but look: Just as we don’t need to be professional statisticians to use statistical tools to create value for the business, we don’t need to be computer scientists to write useful scripts that can free up large chunks of time we now spend on routine tasks that bore us.

(If you work with someone in IT or Advancement Services who pulls and reshapes your data for you, they might be especially interested in the idea of learning how to automate your requests. They might also be familiar with Python already.)

I should say here that my aim is not to automate predictive modelling itself. There are Python modules for modelling, too, from the venerable classics such as regression to the latest advanced techniques. But I’m not so much interested in them, not yet at least. Building predictive models is best done hands-on, guided by a human modeler’s expertise and domain knowledge. My main interest is in eliminating a big chunk of the standard rote work so that I can apply the freshest version of myself to the more interesting and creative elements of data exploration and model creation.

So what is Python (and more specifically, pandas) good for?

  • A script or program can execute a series of database queries and join the results in exactly the way you want, allowing you to build very complex structures and incorporate custom aggregations that might be harder to do using your existing querying/reporting tools. For example, let’s say you want to build a file of donors and include columns for date of first and last gift, amount of highest gift, total cash gifts for the past five fiscal years, and percentage of total giving devoted to student financial assistance. Unless IT has built some advanced views for you from the base tables in your database, many of these variables will require applying some calculations to the raw transactional data. I could certainly build a query to get the results for this modest example, but it would involve a few sub-queries and calculated fields. Multiply that by a hundred and you’ve got an idea of how complex a query you’d have to build to deliver a modelling-ready data set. In fact it may be technically impossible, or at least difficult, to build such a single massive query. In Python, however, you can build your data file up in an orderly series of steps. Adding, removing or editing those steps is not a big deal.
  • Python also makes it simple to read data from .csv and Excel files, and merge it painlessly with the data you’ve extracted from your database. This is important to me because not all of my modelling data comes from our database. I’ve got eight years of call centre data results by alumni ID, wealth-related census data by Canadian postal code, capacity data by American ZIP code, and other standalone data sets. Adding these variables to my file used to be a tedious, manual process. In Python, left-joining 20 columns of census data to a file of 100,000 alumni records using Postal Code as the join key takes a single line of code and executes faster than a knight can say “Ni!” (Inside Python joke.)
  • Many other common operations also take only one or two lines of code, including conversion of categorical variables to 0/1 dummy variables, performing transformations and mathematical operations on variables, filling in or imputing missing data with constants or calculated values, pivoting data, and creating new variables from existing ones via concatenation (for strings) or math (for numbers).
  • With a script, you can also iterate over the rows of a data file and perform different operations based on conditional statements.

I’m not going to provide a Python tutorial today (although I’m tempted to do so in the future), but here is a sample line of code from a script, with a description of what it does. This doesn’t give you enough information to do anything useful, but you’ll at least see how compact and powerful the language is.

Skipping some necessary preliminaries, let’s say you’ve just used Python to query your Oracle database to read into memory a data set containing the variables ID, Constituent Category, Sex, and Age for all living constituent persons. (An operation that itself takes little more than two or three lines of code.) Obviously it depends on your database and code structure, but let’s say “Constituent Category” includes codes for such categories as Alumnus/na (ALUM), Non-degreed alumni (ALND), Parent (PRNT), Friend (FRND), Faculty (FCTY), Staff (STAF), and so on. And let’s further assume that a constituent can belong to multiple categories. Most people will have only one code, but it’s possible that a person can simultaneously be an alum, a parent, and a faculty member.

In our script, the data is read into a structure called a DataFrame (a tool provided by the pandas code library). This should sound familiar to users of R in particular. For the rest of us, a DataFrame is very much like a database table, with named columns and numbered (“indexed”) rows. Had we pasted the data into Excel instead, it might look like this:

pivot1

Right away we see that William and Janet are represented by multiple rows because they have multiple constituent codes. This won’t work for predictive modelling, which requires that we have just one row per individual – otherwise certain individuals would carry more weight in the model than they should. You could say that multiple records for Janet means that 60-year-old females are over-represented in the data. We could delete the extra rows, but we don’t want to do that because we’d be throwing away important information that is almost certainly informative of our modelling target, eg. likelihood to make a donation.

In order to keep this information while avoiding duplicate IDs, we need to pivot the data so that each category of Constituent Code (ALUM, PRNT, etc.) becomes its own column. The result we want would look like this in Excel:

pivot2

The Con_Code column is gone, and replaced with a series of columns, each a former category of Con_Code. In each column is either a 0 or 1, a “dummy variable” indicating whether an individual belongs to that constituency or not.

Getting the data from the first state to the final state requires just three lines of code in Python/pandas:

df = pd.merge(df, pd.crosstab(df.ID, df.Con_Code), how='left', left_on='ID', right_index=True)

df = df.drop(['Con_Code'], axis=1)

df = df.drop_duplicates()

This snippet of code may look invitingly simple or simply terrifying – it depends on your background. Whatever – it doesn’t matter, because my point is only that these three lines are very short, requiring very little typing, yet they elegantly handle a common data prep task that I have spent many hours performing manually.

Here’s a brief summary of what each line does:

Line 1: There’s a lot going on here … First, “df” is just the name of the DataFrame object. I could have called it anything. On the right-hand side, you see “pd” (which is shorthand for pandas, the module of code that is doing the work), then “crosstab,” (a function that performs the actual pivot). In the parentheses after pd.crosstab, we have specified the two columns to use in the pivot: df.ID is the data we want for the rows, and df.Con_Code is the column of categories that we want to expand into as many columns as there are categories. You don’t have to know in advance how many categories exist in your data, or what they are – Python just does it.

Pd.crosstab creates a new table containing only ID and all the new columns. That entity (or “object”) is just sitting out there, invisible, in your computer’s memory. We need to join it back to our original data set so that it is reunited with Age, Sex and whatever other stuff you’ve got. That’s what “pd.merge” does. Again, “pd” is just referencing the pandas module that is providing the “merge” function. The operation is called “merge,” but it’s much the same thing as an SQL-type join, familiar to anyone who queries a database. The merge takes two inputs, our original DataFrame (“df”), and the result from the crosstab operation that I described above. The argument called “how” specifies that we want to perform the equivalent of a left-join. A couple of other optional arguments explicitly tell Python which column to use as a join key (‘ID’).

The crosstab operation is enclosed within the merge operation. I could have separated these into multiple lines, which would have been less confusing, but my point is not to teach Python but to demonstrate how much you can accomplish with a trivial amount of typing. (Or copying-and-pasting, which works too!)

We’re not quite done, though. Our merged data is still full of duplicate IDs, because the Con_Code column is still present in our original data.

Line 2 deletes (“drops”) the entire column named Con_Code, and reassigns the altered DataFrame to “df” – essentially, replacing the original df with the new df created by the drop operation.

Now that Con_Code is gone, the “extra” rows are not just duplicates by ID, they are perfect duplicates across the entire row – there is nothing left to make two rows with the same ID unique. We are ready for the final step …

Line 3 deletes (or “drops”) every row that is a duplicate of a previous row.

Having accomplished this, another couple of lines near the end of the script (not shown) will write the data row by row into a new .csv file, which you can then import into your stats package of choice. If you had two dozen different constituent codes in your data, your new file will be wider by two dozen columns … all in the blink of an eye, without any need for Excel or any manual manipulation of the data.

Excel is perfectly capable of pivoting data like we see in the example, but for working with very large data sets and seamlessly merging the pivoted data back into the larger data file, I can’t think of a better tool than Python/pandas. As the data set gets bigger and bigger, the more need there is to stop working with it in tools that go to the extra work of DISPLAYING it. I suppose one of the beauties of Excel is that you can see the data as you are working on it. In fact, as I slowly built up my script, I repeatedly opened the .csv file in Excel just to have that visual inspection of the data to see that I was doing the right thing. But I inevitably reached the point at which the file was just too large for Excel to function smoothly. At 120,000 rows and 185 columns in a 90MB file, it was hardly Big Data – Excel could open the file no problem – but it was large enough that I wouldn’t want to do much filtering or messing with formulas.

On a quick first read, the code in the example above may seem impenetrable to a non-programmer (like me), but you don’t need to memorize a lot of functions and methods to write scripts in Python. Combing the Web for examples of what you want to do, using a lot of cut-and-paste, perhaps referring to a good book now and again – that’s all it takes, really.

That said, it does require time and patience. It took me many hours to cobble together my first script. I re-ran it a hundred times before I tracked down all the errors I made. I think it was worth it, though – every working piece of code is a step in the direction of saving untold hours. A script that works for one task often does not require much modification to work for another. (This cartoon says it all: Geeks and repetitive tasks.)

Beyond data preparation for predictive modelling, there are a number of directions I would like to go with Python, some of which I’ve made progress on already:

  • Merging data from multiple sources into data extract files for use in Tableau … With version 8.0 of the software comes the new Tableau API for building .tde files in Python. This was actually my first experiment with Python scripting. Using the TDE module and a combination of database queries and pandas DataFrames, you can achieve a high degree of automation for refreshing the most complex data sets behind your views and dashboards.
  • Exploring other modelling techniques besides my regular mainstay (regression) … I’ve long been intrigued by stuff such as neural networks, Random Forest, and so on, but I’ve been held back by a lack of time as well as some doubt that these techniques offer a significant improvement over what I’m doing now. Python gives ready access to many of these methods, allowing me to indulge my casual interest without investing a great deal of time. I am not a fan of the idea of automated modelling – the analyst should grasp what is going on in that black box. But I don’t see any harm in some quick-and-dirty experimentation, which could lead to solutions for problems I’m not even thinking of yet.
  • Taking advantage of APIs …. I’d like to try tapping into whatever social networking sites offer in the way of interfaces, and also programmatically access web services such as geocoding via Google.
  • Working with data sets that are too large for high-level applications such as Excel … I recently tried playing with two days’ worth of downloaded geocoded Twitter data. That’s MILLIONS of rows. You aren’t going to be using Excel for that.

I hope I’ve been able to transfer to you some of my enthusiasm for the power and potential of Python. I guess now you’ll be wondering how to get started. That’s not an easy question to answer. I could tell you how to download and install Python and an IDE (an integrated development environment, a user interface in which you may choose write, run, and debug your scripts), but beyond that, so much depends on what you want to do. Python has been extended in a great many directions – pandas for data analysis being just one of them.

However, it wouldn’t hurt to get a feel for how “core Python” works – that is, the central code base of the language along with its data types, object types, and basic operations such as “for” loops. Even before you bother installing anything, go to Codecademy.com and try a couple of the simple tutorials there.

For specific questions Google is your friend, but if you want a reference that covers all the basics in more or less plain English, I like “Learning Python” (4th Edition, but I see there’s a 5th Edition now) by Mark Lutz, published by O’Reilly. Another O’Reilly book, “Python for Data Analysis,” by Wes McKinney, describes how to crunch data with pandas and other related code libraries. (McKinney is the main author of the pandas library.)

I think readers new to programming (like me) will feel some frustration while learning to write their first scripts using any one book or resource. The Lutz book might seem too fine-grained in its survey of the basics for some readers, and McKinney is somewhat terse when offering examples of how various methods work. The problem is not with the books themselves – they’re wonderful. Consider that Python is used in web interfaces, robotics, database programming, gaming, financial markets, GIS, scientific programming, and probably every academic discipline that uses data – you must understand that core texts are perforce very general and abstract. (Think of grammar books for spoken languages.) It’s up to coders themselves to combine the basic building blocks in creative and powerful ways.

That said, after many, many hours spent hopping back and forth between these books, plus online tutorials and Python discussion forums – and just messing around on my own – I have figured out a few useful ways to accomplish some of the more common data preparation tasks that are specific to predictive modelling. Someday I would be happy to share – and, as always, to learn from the experience of others.

20 February 2013

The ‘analytic’ investment

Filed under: Analytics, Data — Tags: , — kevinmacdonell @ 10:49 am

Everyone’s talking about predictive analytics, Big Data, yadda yadda. The good news is, many institutions and organizations in our sector are indeed making investments in analytics and inching towards becoming data-driven. I have to wonder, though, how much of current investment is based on hype, and how much is going to fall away when data is no longer a hot thing.

Becoming a data-driven organization is a journey, not a destination. Forward progress is not inevitable, and it is possible for an office, a department or an institution to slip backward on the path, even when it seems they’ve “arrived”. In order for analytics to mature from a cutting-edge “nice-to-have” into a regular part of operations, the enterprise needs to be aware of its returns to the bottom line.

In my view, current investments in analytics are often done for reasons that are well-intentioned but vague: It seems to be the right thing to do these days … we see others doing it, so we feel we need to as well … we have an agenda for innovation and this fits the bill … and so on. I’m glad to see the investment, but not every promising innovation gets to stick around. Demonstrating ability to generate revenue — either through savings or through identifying new sources of revenue — will carry the day in the long run.

As I write this, I hear the jangle of railway bells at the level crossing in the early-morning dark outside my hotel room on the city’s downtown waterfront. I’m in Seattle today to attend the DRIVE 2013 conference, hosted by the University of Washington. I’ll be speaking on this topic — the “analytic” investment — later today. I have to admit to having struggled with making the session relevant for this group. For one, they don’t need convincing that making the investment is worth it. And second, if they think that I and my employer have figured out how to calculate the return on investment for analytics programs, they may be in for a disappointment. We have not.

In fact, when it comes right down to it, I like to spend my day working on cool things, interesting problems that face our department, and not so much on stuff that sounds like accounting (“ROI”). I’m betting many of the attendees of my session feel the same way. So I’ll be asking them to stop thinking about how they can get their managers, directors and vice presidents to understand the language of data and analytics. They’ll be far more successful if they try to speak the language their bosses respond to: Return on investment.

I may be a little short on answers for you, but I do have some pretty good questions.

Older Posts »

Blog at WordPress.com.