When I was first learning to build predictive models, preparing the data was part of the adventure. In time, though, many operations on the data became standard instead of exploratory. Eventually they became simply repetitive and tedious. When any task becomes repetitive, I think of ways to automate it. Given that data prep makes up 80 percent of the work of building a model (according to some authors), the benefits of automation are obvious.
I can think of only two ways to replicate the manual operations you need to perform on a large data set to make it ready for modelling: Use software specially designed for the task, or code your own data-handling scripts. I am lazy and drawn to software solutions that make hard things easy, and I’m not a programmer. Yet I have veered away from a ready-made software solution to pursue an interest in the scripting language called Python, and in particular the Python code library called pandas, written specifically for working with data.
Maybe it’s because Python is open-source and free, or because it is powerful, or because it is flexible and widely adaptable to multiple uses on the job. I don’t know. But for the past few months I’ve been obsessed with learning to use it, and that’s what I’d like to talk about today.
I’m guessing very few CoolData readers have experience writing scripts for handling data. I know some people who do most of their stats work in the R language or manipulate data in Excel using VBA. But the majority of readers probably consider themselves severely allergic to coding of any kind. I concede that it isn’t for everyone, but look: Just as we don’t need to be professional statisticians to use statistical tools to create value for the business, we don’t need to be computer scientists to write useful scripts that can free up large chunks of time we now spend on routine tasks that bore us.
(If you work with someone in IT or Advancement Services who pulls and reshapes your data for you, they might be especially interested in the idea of learning how to automate your requests. They might also be familiar with Python already.)
I should say here that my aim is not to automate predictive modelling itself. There are Python modules for modelling, too, from the venerable classics such as regression to the latest advanced techniques. But I’m not so much interested in them, not yet at least. Building predictive models is best done hands-on, guided by a human modeler’s expertise and domain knowledge. My main interest is in eliminating a big chunk of the standard rote work so that I can apply the freshest version of myself to the more interesting and creative elements of data exploration and model creation.
So what is Python (and more specifically, pandas) good for?
- A script or program can execute a series of database queries and join the results in exactly the way you want, allowing you to build very complex structures and incorporate custom aggregations that might be harder to do using your existing querying/reporting tools. For example, let’s say you want to build a file of donors and include columns for date of first and last gift, amount of highest gift, total cash gifts for the past five fiscal years, and percentage of total giving devoted to student financial assistance. Unless IT has built some advanced views for you from the base tables in your database, many of these variables will require applying some calculations to the raw transactional data. I could certainly build a query to get the results for this modest example, but it would involve a few sub-queries and calculated fields. Multiply that by a hundred and you’ve got an idea of how complex a query you’d have to build to deliver a modelling-ready data set. In fact it may be technically impossible, or at least difficult, to build such a single massive query. In Python, however, you can build your data file up in an orderly series of steps. Adding, removing or editing those steps is not a big deal.
- Python also makes it simple to read data from .csv and Excel files, and merge it painlessly with the data you’ve extracted from your database. This is important to me because not all of my modelling data comes from our database. I’ve got eight years of call centre data results by alumni ID, wealth-related census data by Canadian postal code, capacity data by American ZIP code, and other standalone data sets. Adding these variables to my file used to be a tedious, manual process. In Python, left-joining 20 columns of census data to a file of 100,000 alumni records using Postal Code as the join key takes a single line of code and executes faster than a knight can say “Ni!” (Inside Python joke.)
- Many other common operations also take only one or two lines of code, including conversion of categorical variables to 0/1 dummy variables, performing transformations and mathematical operations on variables, filling in or imputing missing data with constants or calculated values, pivoting data, and creating new variables from existing ones via concatenation (for strings) or math (for numbers).
- With a script, you can also iterate over the rows of a data file and perform different operations based on conditional statements.
I’m not going to provide a Python tutorial today (although I’m tempted to do so in the future), but here is a sample line of code from a script, with a description of what it does. This doesn’t give you enough information to do anything useful, but you’ll at least see how compact and powerful the language is.
Skipping some necessary preliminaries, let’s say you’ve just used Python to query your Oracle database to read into memory a data set containing the variables ID, Constituent Category, Sex, and Age for all living constituent persons. (An operation that itself takes little more than two or three lines of code.) Obviously it depends on your database and code structure, but let’s say “Constituent Category” includes codes for such categories as Alumnus/na (ALUM), Non-degreed alumni (ALND), Parent (PRNT), Friend (FRND), Faculty (FCTY), Staff (STAF), and so on. And let’s further assume that a constituent can belong to multiple categories. Most people will have only one code, but it’s possible that a person can simultaneously be an alum, a parent, and a faculty member.
In our script, the data is read into a structure called a DataFrame (a tool provided by the pandas code library). This should sound familiar to users of R in particular. For the rest of us, a DataFrame is very much like a database table, with named columns and numbered (“indexed”) rows. Had we pasted the data into Excel instead, it might look like this:
Right away we see that William and Janet are represented by multiple rows because they have multiple constituent codes. This won’t work for predictive modelling, which requires that we have just one row per individual – otherwise certain individuals would carry more weight in the model than they should. You could say that multiple records for Janet means that 60-year-old females are over-represented in the data. We could delete the extra rows, but we don’t want to do that because we’d be throwing away important information that is almost certainly informative of our modelling target, eg. likelihood to make a donation.
In order to keep this information while avoiding duplicate IDs, we need to pivot the data so that each category of Constituent Code (ALUM, PRNT, etc.) becomes its own column. The result we want would look like this in Excel:
The Con_Code column is gone, and replaced with a series of columns, each a former category of Con_Code. In each column is either a 0 or 1, a “dummy variable” indicating whether an individual belongs to that constituency or not.
Getting the data from the first state to the final state requires just three lines of code in Python/pandas:
df = pd.merge(df, pd.crosstab(df.ID, df.Con_Code), how='left', left_on='ID', right_index=True)
df = df.drop(['Con_Code'], axis=1)
df = df.drop_duplicates()
This snippet of code may look invitingly simple or simply terrifying – it depends on your background. Whatever – it doesn’t matter, because my point is only that these three lines are very short, requiring very little typing, yet they elegantly handle a common data prep task that I have spent many hours performing manually.
Here’s a brief summary of what each line does:
Line 1: There’s a lot going on here … First, “df” is just the name of the DataFrame object. I could have called it anything. On the right-hand side, you see “pd” (which is shorthand for pandas, the module of code that is doing the work), then “crosstab,” (a function that performs the actual pivot). In the parentheses after pd.crosstab, we have specified the two columns to use in the pivot: df.ID is the data we want for the rows, and df.Con_Code is the column of categories that we want to expand into as many columns as there are categories. You don’t have to know in advance how many categories exist in your data, or what they are – Python just does it.
Pd.crosstab creates a new table containing only ID and all the new columns. That entity (or “object”) is just sitting out there, invisible, in your computer’s memory. We need to join it back to our original data set so that it is reunited with Age, Sex and whatever other stuff you’ve got. That’s what “pd.merge” does. Again, “pd” is just referencing the pandas module that is providing the “merge” function. The operation is called “merge,” but it’s much the same thing as an SQL-type join, familiar to anyone who queries a database. The merge takes two inputs, our original DataFrame (“df”), and the result from the crosstab operation that I described above. The argument called “how” specifies that we want to perform the equivalent of a left-join. A couple of other optional arguments explicitly tell Python which column to use as a join key (‘ID’).
The crosstab operation is enclosed within the merge operation. I could have separated these into multiple lines, which would have been less confusing, but my point is not to teach Python but to demonstrate how much you can accomplish with a trivial amount of typing. (Or copying-and-pasting, which works too!)
We’re not quite done, though. Our merged data is still full of duplicate IDs, because the Con_Code column is still present in our original data.
Line 2 deletes (“drops”) the entire column named Con_Code, and reassigns the altered DataFrame to “df” – essentially, replacing the original df with the new df created by the drop operation.
Now that Con_Code is gone, the “extra” rows are not just duplicates by ID, they are perfect duplicates across the entire row – there is nothing left to make two rows with the same ID unique. We are ready for the final step …
Line 3 deletes (or “drops”) every row that is a duplicate of a previous row.
Having accomplished this, another couple of lines near the end of the script (not shown) will write the data row by row into a new .csv file, which you can then import into your stats package of choice. If you had two dozen different constituent codes in your data, your new file will be wider by two dozen columns … all in the blink of an eye, without any need for Excel or any manual manipulation of the data.
Excel is perfectly capable of pivoting data like we see in the example, but for working with very large data sets and seamlessly merging the pivoted data back into the larger data file, I can’t think of a better tool than Python/pandas. As the data set gets bigger and bigger, the more need there is to stop working with it in tools that go to the extra work of DISPLAYING it. I suppose one of the beauties of Excel is that you can see the data as you are working on it. In fact, as I slowly built up my script, I repeatedly opened the .csv file in Excel just to have that visual inspection of the data to see that I was doing the right thing. But I inevitably reached the point at which the file was just too large for Excel to function smoothly. At 120,000 rows and 185 columns in a 90MB file, it was hardly Big Data – Excel could open the file no problem – but it was large enough that I wouldn’t want to do much filtering or messing with formulas.
On a quick first read, the code in the example above may seem impenetrable to a non-programmer (like me), but you don’t need to memorize a lot of functions and methods to write scripts in Python. Combing the Web for examples of what you want to do, using a lot of cut-and-paste, perhaps referring to a good book now and again – that’s all it takes, really.
That said, it does require time and patience. It took me many hours to cobble together my first script. I re-ran it a hundred times before I tracked down all the errors I made. I think it was worth it, though – every working piece of code is a step in the direction of saving untold hours. A script that works for one task often does not require much modification to work for another. (This cartoon says it all: Geeks and repetitive tasks.)
Beyond data preparation for predictive modelling, there are a number of directions I would like to go with Python, some of which I’ve made progress on already:
- Merging data from multiple sources into data extract files for use in Tableau … With version 8.0 of the software comes the new Tableau API for building .tde files in Python. This was actually my first experiment with Python scripting. Using the TDE module and a combination of database queries and pandas DataFrames, you can achieve a high degree of automation for refreshing the most complex data sets behind your views and dashboards.
- Exploring other modelling techniques besides my regular mainstay (regression) … I’ve long been intrigued by stuff such as neural networks, Random Forest, and so on, but I’ve been held back by a lack of time as well as some doubt that these techniques offer a significant improvement over what I’m doing now. Python gives ready access to many of these methods, allowing me to indulge my casual interest without investing a great deal of time. I am not a fan of the idea of automated modelling – the analyst should grasp what is going on in that black box. But I don’t see any harm in some quick-and-dirty experimentation, which could lead to solutions for problems I’m not even thinking of yet.
- Taking advantage of APIs …. I’d like to try tapping into whatever social networking sites offer in the way of interfaces, and also programmatically access web services such as geocoding via Google.
- Working with data sets that are too large for high-level applications such as Excel … I recently tried playing with two days’ worth of downloaded geocoded Twitter data. That’s MILLIONS of rows. You aren’t going to be using Excel for that.
I hope I’ve been able to transfer to you some of my enthusiasm for the power and potential of Python. I guess now you’ll be wondering how to get started. That’s not an easy question to answer. I could tell you how to download and install Python and an IDE (an integrated development environment, a user interface in which you may choose write, run, and debug your scripts), but beyond that, so much depends on what you want to do. Python has been extended in a great many directions – pandas for data analysis being just one of them.
However, it wouldn’t hurt to get a feel for how “core Python” works – that is, the central code base of the language along with its data types, object types, and basic operations such as “for” loops. Even before you bother installing anything, go to Codecademy.com and try a couple of the simple tutorials there.
For specific questions Google is your friend, but if you want a reference that covers all the basics in more or less plain English, I like “Learning Python” (4th Edition, but I see there’s a 5th Edition now) by Mark Lutz, published by O’Reilly. Another O’Reilly book, “Python for Data Analysis,” by Wes McKinney, describes how to crunch data with pandas and other related code libraries. (McKinney is the main author of the pandas library.)
I think readers new to programming (like me) will feel some frustration while learning to write their first scripts using any one book or resource. The Lutz book might seem too fine-grained in its survey of the basics for some readers, and McKinney is somewhat terse when offering examples of how various methods work. The problem is not with the books themselves – they’re wonderful. Consider that Python is used in web interfaces, robotics, database programming, gaming, financial markets, GIS, scientific programming, and probably every academic discipline that uses data – you must understand that core texts are perforce very general and abstract. (Think of grammar books for spoken languages.) It’s up to coders themselves to combine the basic building blocks in creative and powerful ways.
That said, after many, many hours spent hopping back and forth between these books, plus online tutorials and Python discussion forums – and just messing around on my own – I have figured out a few useful ways to accomplish some of the more common data preparation tasks that are specific to predictive modelling. Someday I would be happy to share – and, as always, to learn from the experience of others.