CoolData blog

28 June 2015

Data mining in the archives

Filed under: Data, Predictor variables — Tags: , , , — kevinmacdonell @ 6:24 pm


When I was a student, I worked in a university archives to earn a little money. I spent many hours penciling consecutive index numbers onto acid-free paper folders, on the ultra-quiet top floor of the library. It was as dull a job as one can imagine.


Today’s post is not about that kind of archive. I’m talking about database archive views, also called snapshots. They’re useful for reporting and business intelligence, but they can also play a role in predictive modelling.


What is an archive view?


Think of a basic stat such as “number of living alumni”. This number changes constantly as new alumni join the fold and others are identified as deceased. A straightforward query will tell you how many living alumni there are, but that number will be out of date tomorrow. What if someone asks you how many living alumni you had a year ago? Then it’s necessary to take grad dates and death dates into account in order to generate an estimate. Or, you look the number up in previously-reported statistics.


A database archive view makes such reporting relatively easy by preserving the exact status of a record at regular points in time. The ideal archive is a materialized view in a data warehouse. On a given schedule (yearly, quarterly, or even monthly), an automated process adds fresh rows to an archive table that keeps getting longer and longer. You’re likely reliant on central IT services to set it up.


“Number of living alumni” is an important denominator for such key ratios as the percentage of alumni for whom you have contact information (mail, phone, email) and participation rates (the proportion of alumni who give). Every gift is entered as an individual transaction record with a specific date, which enables reporting on historical giving activity. This tends not to be true of contact information. Even though mailing addresses may be added one after another, without overwriting older addresses, the key piece of information is whether the address is coded ‘valid’ or ‘invalid’. This changes all the time, and your database may not preserve a history of those changes. Contact information records may have “To” and “From” dates associated with them, but your query will need to do a lot of relative-date calculations to determine if someone was both alive and had a valid address for any given point in time in the past.


An archive table obviates the need for this complex logic, and ours looks like the example below. There’s the unique ID of each individual, the archive date, and a series of binary indicator variables — ‘1’ for “yes, this data is present” or ‘0’ for “this data is absent”.




Here we see three individuals and how their data has changed over three months in 2015. This is sorted by ID rather than by the order in which the records were added to the archive, so that you can see the journey each person has taken in that time:


  • A00001 had no valid email in the database in February and March, but we obtained it in time for the April 1 snapshot.
  • A00002 had no contact information at all until just before March, when a phone append supplied us with a new number. The number proved to be invalid, however, and when we coded it as such in the database, the indicator reverted to zero.
  • A00003 appeared in our data in February and March, but that person was coded deceased in the database before April 1, and was excluded from the April snapshot.


That last bullet point is important. Once someone has died, continuing to include them as a row in the archive every month would be a waste of resources. In your reporting software, a simple count of records by archive date will give you the number of living alumni. A simple count of ‘Address Indicator’ will give you the number of alumni with valid addresses. Dividing the number of valid addresses by the number of living alumni (and multiplying by 100) will give you the percentage of living alumni that are addressable for that month. (Reporting software such as Tableau will make very quick work of all this.)


Because an archive view preserves changing statuses over time at the level of the individual constituent, it can be used for reporting trends along any slice you choose (age bracket, geography, school, etc.), and can play a role in staff activity/performance reporting and alumni engagement scoring.


But enough about archive views themselves. Let’s talk about using them for predictive modelling.


In the archive example above, you see a bunch of 0/1 indicator variables. Indicator variables are common in predictive modelling. For example, “Mailing address present” can have one of two states: Present or not present. It’s binary. A frequency breakdown of my data at this point in time looks like this (in Data Desk):




About 78% of living alumni have a valid address in the database today — the records with an address indicator of ‘1’. As you might expect, alumni with a good address are more likely to have given than alumni without, and they have much higher lifetime giving on average. In the models I build to predict likelihood to give (and give at higher levels), I almost always make use of this association between contact information and giving.


But what about using the archive view data instead? The ‘Address Indicator’ variable breakdown above shows me the current situation, but the archive view adds depth by going back in time. Our own archive has been taking monthly snapshots since December of last year — seven distinct points in time. Summing on “Address Indicator” for each ID shows that large numbers of alumni have either never had a valid address during that time (0 out of 7 months), or always did (7 out of 7). The rest had a change of status during the period, and therefore fall between 0 and 7:




A few hundred alumni (387) had a valid address in one out of seven months, 143 had a valid address in two out of seven — and so on. Our archive is still very young; only about 1% of alumni have a count that is not 0 or 7. A year from now, we can expect to see far more constituents populating the middle ground.


What is most interesting to me is an apparent relationship between “number of months with valid address” (x-axis) and average lifetime giving (y-axis), even with the relative scarcity of data:




My real question, of course, is whether these summed, continuous indicators really make much of a difference in a model over simply using the more familiar binary variables. The answer is “not yet — but someday.” As I noted earlier, only about 1% of living alumni have changed status in the past seven months, so even though this relationship seems linear, the numbers aren’t there to influence the strength of correlation. The Pearson correlation for “Address Indicator” (0/1) and “Lifetime Giving” is 0.186, which is identical to the Pearson correlation for “Address Count” (0 to 7) and “Lifetime Giving.” For all other variables except one, the archive counts have only very slightly higher correlations with Lifetime Giving than the straight indicator variables. (Email is slightly lower.)


It’s early days yet. All I can say is that there is potential. Have a look at this pair of regression analyses, both using Lifetime Giving (log-transformed) as the dependent variable. (Click on image for larger view.) In the window on the left, all the independent variables are the regular binary indicator variables. On the right, the independent variables are counts from our archive view. The difference in R-squared from one model to the other is very slight, but headed in the right direction: From 12.7% to 13.0%.




Looking back on my student days, I cannot deny that I enjoyed even the quiet, dull hours spent in the university archives. Fortunately, though, and due in no small part to cool data like this, my work since then has been a lot more interesting. Stay tuned for more from our archives.


20 July 2014

Eliminate your duplicate data row problems with simple SQL

Filed under: Data, SQL — Tags: , , , , , , , — kevinmacdonell @ 1:38 pm

We’ve all faced this problem: Duplicate rows in our data. It messes up our reports and causes general confusion and consternation for people we supply data to. The good news is you can take care of this problem fairly easily in an SQL query using a handy function called LISTAGG. I discovered this almost by accident just this week, and I’m delighted to share.

Duplicate rows are a result of what we choose to include in our database query. If we are expecting to get only one row for each constituent ID, then we should be safe in asking for Gender, or Name Prefix, or Preferred Address. There should be only one of each of these things for each ID, and therefore there won’t be duplicate rows. These are examples of one-to-one relationships.

We get into trouble when we ask for something that can have multiple values for a single ID. That’s a one-to-many relationship. Examples: Any address (not just Preferred), Degree (some alumni will have more than one), Category Code (a person can be an alum and a parent and a staff member, all at the same time) … and so on. Below, the first constituent is duplicated on Category, and the second constituent is duplicated on Degree.


That’s not the worst of it; when one ID is duplicated on two or more elements, things get really messy. If A0001 above had three Category codes and two degrees, it would be appear six times (3 x 2) in the result set. Notice that we are really talking about duplicate IDs — not duplicate rows. Each row is, in fact, unique. This is perfectly correct behaviour on the part of the database and query. Try explaining that to your boss, however! You’ve got to do something to clean this up.

What isn’t going to work is pivoting. In SQL, the PIVOT function creates a new column for each unique category of some data element, plus an aggregation of the values for that category. I’ve written about this here: Really swell SQL: Why you must know PIVOT and WITH. Using PIVOT is going to give you something like the result below. (I’ve put in 1’s and nulls to indicate the presence or absence of data.)



What we really want is THIS:


The answer is this: String aggregation. You may already be familiar with aggregating quantitative values, usually by counting or summing. When we query on a donor’s giving, we usually don’t want a row for each gift — we want the sum of all giving. Easy. For strings — categorical data such as Degree or Category Code — it’s a different situation. Sure, we can aggregate by using counting. For example, I can ask for a count of Degree, and that will limit the data to one row per ID. But for most reporting scenarios, we want to display the data itself. We want to know what the degrees are, not just how many of them there are.

The following simplified SQL will get you the result you see above for ID and Category — one row per ID, and a series of category codes separated by commas. This code works in Oracle 11g — I can’t speak to other systems’ implementations of SQL. (For other Oracle-based methods, check out this page. To accomplish the same thing in a non-Oracle variant of SQL called T-SQL, scroll down to the Postscript below.)

Obviously you will need to replace the sample text with your own schema name, table names, and field identifiers.




The two arguments given to LISTAGG are the field CATG_CODE and a string consisting of a comma and a space. The string, which can be anything you want, is inserted between each Category Code. On the next line, ORDER BY sorts the category codes in alphabetical order.

LISTAGG accepts only two arguments, but if you want to include two or more fields and/or strings in the aggregation, you can just concatenate them inside LISTAGG using “pipe” characters (||). The following is an example using Degree Code and Degree Code Description, with a space between them. To get even fancier, I have replaced the comma separator with the character code for “new line,” which puts each degree on its own line — handy for readability in a list or report. (This works fine when used in Custom SQL in Tableau. For display in Excel, you may have to use a carriage return — char(13) — instead of a line feed.) You will also notice that I have specified a join to a table containing the Degree data.






Before I discovered this capability, the only other automated way I knew how to handle it was in Python. All the previous methods I’ve used were at least partially manual or just very cumbersome. I don’t know how long string aggregation in Oracle has been around, but I’m grateful that a data problem I have struggled with for years has finally been dispensed with. Hurrah!



After publishing this I had an email response from David Logan, Director of Philanthropic Analytics at Children’s Mercy in Kansas City, Missouri. He uses TSQL, a proprietary procedural language used by Microsoft in SQL Server. Some readers might find this helpful, so I will quote from his email:

I was reading your post this morning about using the LISTAGG function in Oracle SQL to concatenate values from duplicate rows into a list. A couple of months ago I was grappling with this very problem so that I could create a query of Constituents that included a list of all their phone numbers in a single field. LISTAGG is not available for those of us using T-SQL … the solution for T-SQL is to use XML PATH().

Here’s a link to a source that demonstrates how to use it:

One key difference in this method from LISTAGG is that the delimiter character (such as the comma—or characters as in your example where you have comma space) is added before each value returned so the raw string returned will have this character before the first value. You can use the STUFF() function to eliminate this first delimiter.

Here’s my solution using XML PATH() to return a list of phone numbers from our donor database:

   '; ' + PH.NUM + ' [' + re7_ReportingTools.dbo.GetLongDescription(PH.PHONETYPEID) + ']'
     re7_db.dbo.PHONES PH

I concatenated my list using a semi-colon followed by a space ‘; ‘. I used the STUFF() function to remove the leading semi-colon and space.

Here are a few sample rows that are returned (note I’ve “greeked” the prefixes for donor privacy):


25 March 2014

An all-SQL way to automate RFM scoring

Filed under: RFM, SQL — Tags: , , , — kevinmacdonell @ 8:44 pm

In my last post I described how to automate RFM scoring using the scripting language Python and pandas. Automating RFM scoring is a great way to learn a bit of Python — but on its own it’s a lousy reason for learning Python. It was pointed out to me that you don’t really need to call on the power of Python if you’re already comfortable with (or learning to use) SQL, and I agree.

Shortly after publishing the Python post, I got an email from Bruce Higgins, who works in the Alumni Affairs & Development office at Cornell University. His all-SQL solution is shown below. I leave it to readers to figure out how to make it work for them. In the wake of my post there was some great discussion and code-sharing on the Prospect-DMM discussion list, and I encourage you to look up that thread.

You’ll notice that Bruce has hard-coded the dollar values used for scoring “Monetary value” instead of dynamically dividing the file into quintiles. As he points out, if you’re scoring the same database over and over, it’s not likely these dollar thresholds are going to change much over time. You will of course need to make adjustments for your own donor data.

WITH Five_years AS
 SUM((gift_amount)) AS cash,
 MAX(decode(.gift_fiscalyear, 2014, 5, 2013, 4, 2012, 3, 2011, 2, 2010, 1)) AS recency,
 COUNT(DISTINCT(gift_fiscalyear)) AS frequency
(SQL specifying the last five years cash transactions from our warehouse)
WHEN >= 10000 THEN 5
WHEN >= 2500 THEN 4
WHEN >= 1000 THEN 3
WHEN >= 500 THEN 2
END + Five_years.recency + Five_years.frequency AS rfm_score
FROM Five_years

2 March 2010

Fun, creative and lesser-known predictive variables

Filed under: Alumni, Predictor variables — Tags: , , , — kevinmacdonell @ 12:58 pm

Your next predictive variable will be found here. (Creative Commons license. Click for source)

University offices record all kinds of things in their databases simply in order to run their own processes: mailing the alumni magazine, ticketing for events, coding mailing preferences and on and on. Finding novel predictors for your models requires talking to colleagues in your department (and around campus) about the database screens they use, and the things they track. Exploring these avenues can be rewarding and rather social as well!

Here are a few variables I’ve tested which might be lesser-known than the ones I’ve written about earlier. These aren’t likely to appear near the top of your list of variables that are most highly correlated with giving, but it certainly won’t hurt to throw some into a regression analysis. Some variables will be more or less valuable depending on what you’re trying to predict. Some of these are negative predictors; that’s hardly a bad thing, as negative predictors will help to further differentiate the prospect pool, allowing your best prospects to stand out from the crowd.

Here we go:

Does your institution have a records researcher? When mail is returned as undeliverable to the alumni office, this person is busy coding alumni as “lost”, which marks them for later research. These codes may persist in your database after the alum is found, or they might be replaced with another code. In either case, I’ve found that alumni who allow themselves to become lost are less likely to give. A great negative predictor.

Does your alumni magazine have a “green delivery” option? Some alumni opt to access their magazine exclusively by electronic means, as a PDF download perhaps. Mailing preferences are tracked in your database, and often any sort of stated preference is a predictor.

You may already be using ‘number of phonathon refusals’ as a variable, but does your calling program record the reasons for refusal? “Financial reasons” might be a negative predictor, but not all reasons have to be negative. I’ve found that alumni who refuse because they want to handle the donation on their own (for example, mail a cheque when and if they feel like it) are excellent donors. They’re just rather phone-averse.

What about cross-references? We record family relationships among alumni – even grandparent/grandchild and in-laws. I’ve found ‘number of cross-references’ to be a significant predictor.

Alumni who want to be excluded from affinity programs (credit cards, insurance etc.) may be coded in your database so they do not receive unwanted mailings for those products. A negative predictor.

There might be a weird variable or two lurking in people’s names. For certain models, I’ve found that having a first or middle name that consists of a single initial is a positive predictor. This is somewhat correlated with age, but even after adding ‘class year’ to my regression, this variable will still improve the fit of the model. As well, Peter Wylie has written about the character length of an entire name (Prefix, First, Middle, Last, Suffix) being a predictor. Try it.

A year or so ago, I figured out how to query the database to easily retrieve the number of address updates for each alum. This only works when your records personnel create a new address record every time, instead of replacing the previous record. If an alum keeps their alma mater informed of their whereabouts, they’re probably more engaged – and more likely to give (and attend events). Ditto for number of phone updates and number of employment updates.

The previous idea is related to “class notes” for the alumni magazine. Some universities enter alumni submissions into their database so they can run their notes as a report. We don’t, but I wish we did, because I know ‘number of notes’ would be a predictor.

This might be the tip of the iceberg. Think of all the other great sources of variables that result from normal daily processes (gift processing data, online social networking data, automated call centre data, survey data …), have those conversations with your colleagues, and figure out how to get your hands on those variables for testing.

Create a free website or blog at