Some data manipulation needs are so common, and their solutions so elusive, that when I find an answer, I just have to share. Today I will show you two problems and their solutions, wrapped up into one.
The first problem: Given a database query pulling any kind of transactional data (involving dollars, say), how do you go about aggregating the data in more than one way, within the confines of a single query, so that the various aggregations appear on one row of data?
The second problem: Given a database query that returns rows that are duplicates (by constituent ID, for example) due to some category that appears multiple times per ID (category code, gift year or whatever), how do you get rid of those duplicate rows without losing any of the data? In other words, how do you pivot the data so that the categories are added as columns rather than rows, while preserving the structure of all the other data that you don’t want to pivot?
That’s as clear as I can describe it … believe me, if you work with data, you encounter these situations all the time. An illustration might be helpful. Here is some donor gift data, limited to just Donor ID, Fiscal Year Code, and Giving Total, which is the sum of Gift Amount by ID and fiscal year. (For the sake of simplicity, I am using a code for fiscal year instead of extracting the fiscal year from actual gift dates, and I am limiting the data to three fiscal years of donation history.)
This is fine, but notice that some IDs appear on multiple rows. What if we want only one row of data per donor? And what if we want an overall giving total for the donor in one column, and then yearly totals for the three fiscal years in our data — one column and total per year? Like this:
Today I’ll show you a couple of techniques to combine different types of aggregations into a single, short SQL statement that isn’t difficult to read. (All of this works in Oracle 11g — I can’t speak to other systems’ implementations of SQL.) My primary need for writing such a query is for embedding custom SQL in a Tableau data source for the purpose of reporting. I don’t really require fancy SQL to render this precise example in Tableau. This is just a very simple example which you can adapt to more complex situations.
The SQL statement that produced our first set of results looks like this. Again for simplicity, let’s assume we require only one table, called GIFTS, so there are no joins to consider:
SELECT GIFTS.ID, SUM ( GIFTS.GIFT_AMT ) AS GIVING_TOTAL, GIFTS.FISC_YR_CODE FROM GIFTS WHERE GIFTS.FISC_YR_CODE IN ('FY11', 'FY12', 'FY13') GROUP BY GIFTS.ID, GIFTS.FISC_YR_CODE
We end up with one type of aggregation: The SUM of all gifts, grouped by ID and Fiscal Year. But there is no way to include a SUM by ID only, because if we ask for FISC_YR_CODE in the SELECT, we have to include it in the GROUP BY as well. We will deal with that issue a little later on.
First let’s deal with the duplicate IDs. Each donor ID appears once for every fiscal year out of the three that the donor made a gift. We want to pivot the data, so that the rows become columns instead. Introducing … the PIVOT operator! Have a look at the following SQL:
SELECT * FROM ( SELECT GIFTS.ID, GIFTS.GIFT_AMT, GIFTS.FISC_YR_CODE FROM GIFTS WHERE GIFTS.FISC_YR_CODE IN ('FY11', 'FY12', 'FY13') ) PIVOT ( SUM ( GIFT_AMT ) AS year FOR ( FISC_YR_CODE ) IN ( 'FY11' AS FY11, 'FY12' AS FY12, 'FY13' AS FY13 ) )
The inner SELECT statement gets all the data and hands it to the PIVOT operator, and the outer SELECT asks for all the resulting columns from that pivot. The three categories we want to appear as columns (that is, each fiscal year) are specified with IN, and the word “year” is appended to each column name. (This text can be anything you want.) I have added aliases (using AS) in order to prevent the single quote marks, which are required, from appearing in the results.
The content of the “cells” will be the SUM of gift amounts for each ID and fiscal year. The result will look like this:
Useful, eh? PIVOT does have some limitations, which I will discuss later. Let’s press on. We’re definitely getting close to realizing our goal. Now we just need to have one other aggregation (total giving by donor) appear on the same line. The problem, noted earlier, is that to SUM on giving by ID only, we are forced to leave out all other columns, in this case the fiscal year code. The SQL for giving by donor looks like this:
SELECT GIFTS.ID, SUM ( GIFTS.GIFT_AMT ) AS GIVING_TOTAL FROM GIFTS WHERE GIFTS.FISC_YR_CODE IN ('FY11', 'FY12', 'FY13') GROUP BY GIFTS.ID
Clearly we need not one but two queries, the first to get total giving by donor, and the second to get giving by ID and year (pivoted), and then join the two result sets as if they were tables. Normally this would call for creating temporary tables in the database to store the results for further querying, but data analysts are not DBAs: We may not have sufficient permissions to perform this operation on the database. (As far as I am aware, by default Tableau disallows table creation, perhaps to reassure database admins?)
Fortunately there is a solution. It’s not well-known — my fat Oracle 11g SQL reference book doesn’t even mention it — so you’ll need to do some online searches to find out more. Introducing … the WITH clause!
It works like this. Each WITH clause contains a SELECT statement that returns a result set that behaves like a temporary table, and which can be referenced in other SELECT statements. This example uses two clauses, aliased as ‘total_giving’ and ‘yearly_giving’. A final SELECT statement joins the two as if they were tables.
WITH total_giving AS ( SELECT GIFTS.ID, SUM ( GIFTS.GIFT_AMT ) AS GIVING_TOTAL FROM GIFTS WHERE GIFTS.FISC_YR_CODE IN ('FY11', 'FY12', 'FY13') GROUP BY GIFTS.ID ), yearly_giving AS ( SELECT * FROM ( SELECT GIFTS.ID, GIFTS.GIFT_AMT, GIFTS.FISC_YR_CODE FROM GIFTS WHERE GIFTS.FISC_YR_CODE IN ('FY11', 'FY12', 'FY13') GROUP BY GIFTS.ID, GIFTS.FISC_YR_CODE ) PIVOT ( SUM ( GIFT_AMT ) AS year FOR (FISC_YR_CODE) IN ('FY11' AS FY11, 'FY12' AS FY12, 'FY13' AS FY13) ) ) SELECT total_giving.ID, total_giving.GIVING_TOTAL, yearly_giving.FY11_YEAR, yearly_giving.FY12_YEAR, yearly_giving.FY13_YEAR FROM total_giving, yearly_giving WHERE total_giving.ID = yearly_giving.ID
You can accomplish what I’ve done in this example without using the WITH clause, but as an SQL statement gets more complex, it gets harder to read. As far as I know, you can use as many WITH clauses as you like, allowing you to build complex queries while preserving the neat organization and therefore readability of the SQL statement. Readability is important when you must modify or debug the statement weeks later. Oracle error messages are maddeningly uninformative!
I encourage you to research WITH and PIVOT on your own, but here are a few pointers:
- Type ‘WITH’ only once. Multiple WITH clauses are separated by a comma. (No comma following the last clause.)
- A WITH clause can reference any previously-defined WITH clause as if it were an already-existing table. Very useful.
- When using PIVOT, notice that we must specify all the categories in the pivot (fiscal year code, in this case). Unfortunately, PIVOT is not capable of dynamically pivoting whatever data it happens to find. This is just the way SQL works. If you want to pivot data without knowing in advance what the categories are, you may have to use a true programming language. I use Python, but that is not always an option.
- You can use any aggregation you like in the PIVOT — COUNT, SUM, MAX, etc.
- Don’t include the schema and table names in the identifiers named in PIVOT, or you will get an error.
I hope this has been of some help. The nuts and bolts of pulling data for analysis may not be the sexiest part of a data analyst’s job, but brushing up on these skills (whatever your database system is) will pay off in the long run.